2

I'm trying to write a regexp that catches assignments equal signs within conditional statements in C language that I already extracted (using python module re).

My attempt:

exp = re.compile(r'\(\s*[0-9A-Za-z_]+\s*[^!<>=]=[^=]')

While working for a number of cases, it fails to match a simple case like the following string:

'(c=getc(pp)) == EOF'

Can someone please explain why my regexp is not a match for this string, and how could I make it better ? I'm aware that some weird cases might still elude me, but I can treat those manually, the purpose is to do the bulk of the legwork automatically.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Valentin B.
  • 602
  • 6
  • 18

2 Answers2

1

[^!<>=] following your identifier prevents = to be matched after c.

If your intention is to match assignments, try to match only one equal sign after the identifier, like this:

exp = re.compile(r'\(\s*[0-9A-Za-z_]+\s*=[^=]')

print(exp.search('(c=getc(pp)) == EOF'))

which results in:

<_sre.SRE_Match object; span=(0, 4), match='(c=g'>
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
1

The reason why this does not work is [^!<>=]=, which makes your code look for a character which is not = followed by a character which is =. I can see your intention in doing so, but it's the wrong way.

For the simple, case have a look at the following expression:

[0-9A-Za-z_]+\s*=\s*[0-9A-Za-z_]+(\(\s*[0-9A-Za-z_]*\s*\))?

This matches the c=getc(pp) part of your source, because it looks for a = which is either followed (or preceded) by optional whitespaces and characters or numbers. Already this prevents the regex from matching ==, <=, !=, or >=.

Aside of that it also looks if the right hand side is a function or simply a variable or just a number (optional match through ? for the bracket-part at the end of the expression). Note also the * for the matching part within the braces ([0-9A-Za-z_]*), which enables you to match function calls without parameters.

Works for:

(c=getc(p)) == EOF
(c =getc()) == EOF
(c=getc( )) == EOF
(c = getc( p )) == EOF
(c = i) == EOF
(c=10) == EOF

This will not work for constructs, such as x = y(z()) (and surely many more).

Aside of this being said, I recommend the following link (not exactly your question, but really nice insights): Regular expression to recognize variable declarations in C

ohmmega
  • 55
  • 6
  • Thank you for your complementary insight ! – Valentin B. Oct 11 '17 at 13:17
  • You're welcome. In the case that you put more effort into your expression, let us know your solution. – ohmmega Oct 12 '17 at 05:59
  • Besides adding `&\->*` to the set corresponding to the variable name to take into account structures, pointers and adresses, I'm affraid my regexp knowledge is very basic so I don't see what else could be done ! – Valentin B. Oct 12 '17 at 12:12