2

I have an expression like:

ENTITY first
    VHDL language standard: 3 (VHDL-2008)
  ARCHITECTURE BODY arch
    VHDL language standard: 3 (VHDL-2008)

Now I want a regexp for only the first paranthesis after ENTITY so the result should be VHDL-2008 or even 2008.

I'm new to regexps. What I tried:

"^ENTITY *(.*)"

only returns "first". So my question is: How can I request a newline after "first"? My try:

"^ENTITY .*\\n(.*)"

And very confusing was the result of

"^(.*)"

which added some { and }. Why?

I have found a very ugly way to do this:

first eliminate newlines

set data [regsub -all "\n" $data ""]

and then something like this:

{ENTITY risc .*VHDL language standard: [0-3]..VHDL-(.*).}

As you can see I didn't understand how to recognize { or ( paranthesis. Any better solution?

Sadık
  • 4,249
  • 7
  • 53
  • 89

2 Answers2

4

Assuming your expression is stored as a single string, you don't have to do anything special to accomodate newlines: the regexp man page says "By default, newline is a completely ordinary character with no special meaning."

To match the contents of the first set of parentheses, you can do:

% set str {ENTITY first
    VHDL language standard: 3 (VHDL-2008)
  ARCHITECTURE BODY arch
    VHDL language standard: 3 (VHDL-2008)}
% regexp {^ENTITY[^(]+\(([^)]+)} $str -> vhdl
1
% puts $vhdl
VHDL-2008
% # or use non-greedy matching
% regexp {^ENTITY.+?\((.+?)\)} $str -> vhdl
1
% puts $vhdl
VHDL-2008
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
  • And if the whole thing _isn't_ in a single string, you'll find getting a RE to match it very hard indeed. – Donal Fellows Jan 14 '14 at 15:10
  • thank you. I'll take regexp '{^ENTITY[^(]+\(VHDL-([^)]+)} $str -> vhdl' , to get only the number. – Sadık Jan 14 '14 at 17:37
  • That won't work because the open parentheses before the VHDL is not escaped. If you only want the number, use `regexp {^ENTITY[^(]+\(VHDL-(\d+)} $str -> vhdlnum` – glenn jackman Jan 14 '14 at 17:41
  • Well, it worked, but with \d is even better I guess. Thank you – Sadık Jan 14 '14 at 19:43
1

(, ), {, and } are metacharacters. That means that for them to be recognized as normal characters, they have to be escaped with a \ like this: \(, \), \{, and \}.

On some operating systems, a new line is just \n, but on others, it is \r\n. A regex that will match both of those is \r?\n.

Try using this regex instead of "^ENTITY .*\\n(.*)":

ENTITY(?:.*\\r?\\n)*?.*\\((.*)\\)

You can find a demo and explanation here.

The Guy with The Hat
  • 10,836
  • 8
  • 57
  • 75