-1

While writing a program to detect repeating patterns in binary I came across a weird instance where a regex does not seem to properly match in python.

The regex is ran as followed:

pattern = re.compile("^0b(1*)(0*)(\1\2)*(\1)?$")
result = pattern.match("0b101")

What I would expect to see is the following matching groups:

  • 1: '1'
  • 2: '0'
  • 3: empty
  • 4: '1'

But instead I get no match at all. According to the website regex101 the match should be as expected, but python seems to disagree.

Is there a difference between interpreters in python and the website or just some small mistake I'm missing?

martijn p
  • 598
  • 4
  • 19
  • Is this what you are after? https://stackoverflow.com/questions/5618988/regular-expression-parsing-a-binary-file – panoskarajohn Dec 03 '19 at 10:08
  • 3
    First off you're not escaping your backslashes... you might want to try with a raw-string, eg: `r"^0b(1*)(0*)(\1\2)*(\1)?$"` - which will then match your entire string, but then you still need to group accordingly – Jon Clements Dec 03 '19 at 10:08
  • 1
    The given input doesn't have a 3rd group, but has first, second and **fourth** group, because `\1\2` doesn't match, and the final `\1` does match (the 4th group). – Maroun Dec 03 '19 at 10:09
  • @JonClements oh man, you're absolutely correct! Seems like that fixed it. Don't know how I missed it haha. – martijn p Dec 03 '19 at 10:13

2 Answers2

2

and the website

I'm assuming you created your regex using one of the websites like regex101, right?

If you look closely, regex101, it hints it uses raw strings.

In your case:

pattern = re.compile("^0b(1*)(0*)(\1\2)*(\1)?$")

Python tries to interpret \1 as normal escape sequences - like \n etc.

What you need, is \ that after string parsing, regex parser can parse.

This means, escaping the backslash - \\ or using a raw string, so that Python knows it shouldn't parse any \ns and similar ones.

pattern = re.compile(r"^0b(1*)(0*)(\1\2)*(\1)?$")
h4z3
  • 5,265
  • 1
  • 15
  • 29
0

The regex ^0b(1*)(0*)(\1\2)*(\1)?$, applied on 0b101, matches the following groups (matches are bolded):

  • group 1 - 0b101
  • group 2 - 0b101
  • group 3 - no match, since "10" wasn't encountered
  • group 4 - 0b101 (successfully matches \1, which is a "1")

>>> pattern = re.compile(r"^0b(1*)(0*)(\1\2)*(\1)?$")
>>> result = pattern.match("0b101")
>>> result.groups()
('1', '0', None, '1')
Maroun
  • 94,125
  • 30
  • 188
  • 241
  • That's not the problem. In current way the code is written `.match` returns `None`. You haven't even executed the posted code. – h4z3 Dec 03 '19 at 10:13
  • Since group 3 is ended with a * symbol it should still result in a match however – martijn p Dec 03 '19 at 10:13
  • @h4z3 "*What I would expect to see is the following matching groups: 1: '1' 2: '0' 3: '1'*" - I was referencing that comment, referring to the "3" group. – Maroun Dec 03 '19 at 10:14
  • @martijnp No, it'll not be matched, it'll be discarded. My answer is purely about the regex, not Python, and it was referring your expectation to have the third group matched, which is not correct because only groups 1, 2 and 4 are matched. – Maroun Dec 03 '19 at 10:15
  • It matches, it just places what I put as group 3 in group 4. I forgot to add that group 3 is empty and just left it out since it doesn't contain anything – martijn p Dec 03 '19 at 10:18
  • @martijnp Please see my edit........... – Maroun Dec 03 '19 at 10:19