1

I'm writing a small helper script to analyse C code, especially the use of structs. I have problems detecting when a struct is used as a value as opposed to a pointer. That means I want to detect if the text struct foo is followed by an arbitrary amount of whitespace and a character that is not *.

I boiled my problem down to this MWE:

>>> import re
>>> there = re.compile('struct foo(\\s*)[^*]')
>>> match = there.search('struct foo *bar')

Note. I need to use the double backslash because I cannot use raw strings in my application. I actually need an f-string.

The MWE should not produce a match in my book. However, it does and if I look at match.groups(), I get

>>> match.groups()
('',)

meaning that \\s* did match zero whitespace characters. From the documentation I would have expected it to match the single space before *foo in my string as the * quantifier should match zero or more characters greedily.

Exchanging \\s with [ \t] or even * (note the space) does not make a difference either.

Why does \\s* seem to match zero characters in presence of a space?

arne
  • 4,514
  • 1
  • 28
  • 47
  • Try removing the first slash in your capture group and see if that helps. – whege Oct 06 '20 at 15:13
  • @LiamFiddler: No change. I also tried to replace `\\s` with `[ \t]` as noted without any difference. Even using only `( *)` as the capture group does not work. – arne Oct 06 '20 at 15:21

3 Answers3

2

I think you just want to make sure that the final character group doesn't match space characters. So you want:

struct foo(\\s*)[^*\\s]
CryptoFool
  • 21,719
  • 5
  • 26
  • 44
1

I would use this regular expression:

(?:struct foo\s*)([^*\s]+)

This will return you what comes after the spaces if no asterisk is provided.

Example: struct foo *bar would return nothing.
struct foo bar would return bar.

Test and explanation here: https://regex101.com/r/dVeHc3/1

Valentin Grégoire
  • 1,110
  • 2
  • 12
  • 29
  • 1
    This works, but I still don't know why my regex, specifically `\s*` doesn't. – arne Oct 06 '20 at 15:26
  • That is because your regex expects zero or more whitespaces after `struct foo`. That matches, and then the part `[^*]` matches the part after as soon as the previous condition met (`struct foo`), but the space behind it got already matched by the part before that, so it matches an empty string. At least, that would be my assumption. – Valentin Grégoire Oct 07 '20 at 13:05
1

(\\s*) is correctly matching zero spaces. The [^*] can't match against the * in the text, so it should match against the previous character, which is the only available space that (\\s*) would have matched against.

Bill Lynch
  • 80,138
  • 16
  • 128
  • 173
  • But shouldn't match evaluation proceed from left to right, i.e. the `(\\s*)` should consume as many spaces as possible (greedy) before evaluation of the next token in the regex starts? – arne Oct 06 '20 at 15:34
  • 1
    A greedy match will match as much as possible, but it will consume less than everything to make a match. If it didn't do that, then the regex `.*a` would never match anything. – Bill Lynch Oct 06 '20 at 15:36
  • Sounds like you want your expression to be `struct foo(\\s*)[^*\\s]` – CryptoFool Oct 06 '20 at 15:38
  • @Steve: add that as an answer or edit into this answer and I'll accept. – arne Oct 06 '20 at 15:47