5

How can I match r'\a' in Python using lookbehind assertion?
Actually, I need to match C++ strings like "a \" b" and

"str begin \
end"

I tried:

>>> res = re.compile('(?<=\)a')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/re.py", line 190, in compile
    return _compile(pattern, flags)
  File "/usr/lib/python2.7/re.py", line 244, in _compile
    raise error, v # invalid expression

>>> res = re.compile('(?<=\\)a')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/re.py", line 190, in compile
    return _compile(pattern, flags)
  File "/usr/lib/python2.7/re.py", line 244, in _compile
    raise error, v # invalid expression
sre_constants.error: unbalanced parenthesis

>>> res = re.compile('(?<=\\\)a')
>>> ms = res.match(r'\a')
>>> ms is None
True

Real Example:
When I'm parcing "my s\"tr"; 5; like ms = res.match(r'"my s\"tr"; 5;'), the expected output is: "my s\"tr"

Answer
Finally stribizhev provided the solution. I thought my initial regex is less computationally expensive and the only issue was that it should be declared using a raw string:

>>> res = re.compile(r'"([^\n"]|(?<=\\)["\n])*"', re.UNICODE)
>>> ms = res.match(r'"my s\"tr"; 5;')
>>> print ms.group()
"my s\"tr"
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
luart
  • 1,383
  • 1
  • 18
  • 25
  • Why are you looking behind `a`? What is the actual pattern you are trying to match? – thefourtheye Apr 29 '15 at 07:07
  • 1
    try with `\\\\ ` instead of `\\\ ` – Morb Apr 29 '15 at 07:09
  • To "thefourtheye": actual pattern is: res = re.compile('"([^\n"]|(?<=\\)["\n])*"') to match strings like: ms = res.match('"my s\"tr"; 5;') To "Morb": '\\\\' is parsed, but does not work as expected. – luart Apr 29 '15 at 07:16
  • Sorry Morb, you are right! re.compile('"([^\n"]|(?<=\\\\)["\n])*"') also solves the issue, I just didn't supply the string as raw when tested the first time – luart Apr 29 '15 at 08:30
  • @luart: Please consider changing the title to something like `Match C++-like quoted strings regex`. I was trying to find one and failed. – Wiktor Stribiżew Apr 29 '15 at 09:31
  • Done, thanks stribizhev. – luart Apr 29 '15 at 09:36

4 Answers4

2

Assuming that the source code compiles, this is the classic solution to match regular string literal in C and C++, taking into account line continuation syntax:

(?s)"(?:[^"\\\n]|\\.)*"

On retrospects, since I already assume the source code compiles, there is no need to prevent stray new lines which are not part of line continuation syntax in [^"\\\n], so using only [^"\\] would also work.

The regex above matches all the following test cases correctly:

"a \" b"

"a \
 b"

"\\"

"\\\
kjsh\a\b\tdfkj\"\\\\\\"

"kjsdhfksd f\\\\"

"kjsdhfksd f\\\""

Demo on regex101

stribizhev's old answer (?s)((?<!\\)".+?(?<!(?<!\\)\\)") fails to match valid case of "kjsdhfksd f\\\"", and adding more look-behind only fix the issue for a limited number of \.

The possibility of many consecutive \ in a row in a string literal is the reason why such regex doesn't work, and why we should not use split operation to tokenize CSV with quoted fields.

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • Note that this solution hasn't been tested with raw string literal in C++, and I believe it requires modification to work with it. – nhahtdh Apr 29 '15 at 08:47
  • Thank you nhahtdh. Basically final solution from stribizhev is the refined yours! – luart Apr 29 '15 at 09:39
1

EDIT: The final regex is an adaptation from the regex provided at Word Aligned

I think you are looking for this regex:

(?s)"(?:[^"\\]|\\.)*"

See demo on regex101.

Sample Python code (tested on TutorialsPoint):

import re
p = re.compile(ur'(?s)"(?:[^"\\]|\\.)*"')
ms = p.match('"my s\\"tr"; 5;')
print ms.group(0)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • This is not an answer to this question at all! pleas obtain `a` from `\a` if you can – Mazdak Apr 29 '15 at 07:23
  • @Kasra: *Actually I need to match C++ strings like "a \" b" and *"str begin \ end"*. – Wiktor Stribiżew Apr 29 '15 at 07:24
  • @luart: Do not use `match` since this method only looks for a match at the beginning of a string. Use `findall`. Or at least `search`. – Wiktor Stribiżew Apr 29 '15 at 07:30
  • To stribizhev: I'm always positioned on the begin of the string and need to extract just this first string from all the text, so the "match" fits fine for me – luart Apr 29 '15 at 07:41
  • @luart: Ok, it is clearer now. Just note that greedy matching suggested by Kasra (`"(.*)"`) might lead to overmatching in case there are more than 1 value inside double quotation marks. I am using look-behinds to make sure that does not happen. – Wiktor Stribiżew Apr 29 '15 at 07:45
  • To stribizhev: Thanks a lot, it works! re.compile(ur'(?s)((?<!\\)".+?(?<!\\)")') – luart Apr 29 '15 at 07:51
  • @luart: Great! I have added a Python code sample for you to check out my approach. I also believe you have a typo in the example you supplied (the `\\` must be doubled). – Wiktor Stribiżew Apr 29 '15 at 07:51
  • Note that `.+?` is always a bad idea to match string literal, and you simply reject the valid case of `"\\"` – nhahtdh Apr 29 '15 at 07:53
  • Thanks stribizhev, I also fixed the typo in the example – luart Apr 29 '15 at 07:56
  • @nhahtdh: This can be fixed by adding this marginal case as an alternative: `(?s)((?<!\\)".+?(?<!\\)"|"\\\\")`: https://regex101.com/r/yN0cN9/2 – Wiktor Stribiżew Apr 29 '15 at 07:58
  • 1
    @stribizhev: That is not a way to fix. It is only avoiding the issue. You only get away with it because it's rare in source file. – nhahtdh Apr 29 '15 at 08:03
  • @nhahtdh: I see your point. I modified the regex to make sure a double backslash is captured, too, with a look-behind inside a look-behind. If you know more cases like that, please share, or post your own answer. However, the current version seems already safe. No need to check for backslashes before the first double quotation mark, I think. – Wiktor Stribiżew Apr 29 '15 at 08:11
  • @stribizhev: Check my answer. I don't understand your notation of "safe". In these cases, at least you should be able to match any correct string literal from a compilable source code. This is a problem encountered and solved over and over again on SO, and if the question has mentioned about the possibility of escape sequence, then the rigid answer is preferred over this answer.. – nhahtdh Apr 29 '15 at 08:28
  • Ok, I have searched a bit on the Web, and found a similar solution to @nhahtdh's. I adapted it a bit. – Wiktor Stribiżew Apr 29 '15 at 08:58
  • 1
    @stribizhev: It's basically the same, yes. My regex disallows `\n` (which is not part of line continuation) to appear in the string, but on retrospects, it only offers a false sense of security, as I already assume that the code compiles. – nhahtdh Apr 29 '15 at 09:26
1

A better way, you can avoid to repeat an alternation with only one character if you "unroll" the pattern like that:

(?s)"[^"\\]*(?:\\.[^"\\]*)*"

Note that you don't need to use a lookbehind too.

As suggested by nhahtdh, if you want to ensure/check that all the string is on one line, you only need to exclude \n from the character classes:

(?s)"[^"\\\n]*(?:\\.[^"\\\n]*)*"
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
0

As \ is escape character you need to use \\ (escape it one time) within your string too, Because python will interpret \a as a hex :

>>> '\a'
'\x07'

also you must use re.search because re.match mchecks for a match only at the beginning of the string :

>>> re.search(r'(?<=\\)a','\\a')
<_sre.SRE_Match object at 0x7fb704dd0370>
>>> re.search(r'(?<=\\)a','\\a').group(0)
'a'

But for your last example you don't need look around at all you can use a simple grouping :

>>> re.search(r'"(.*)"','"my s\"tr"; 5;').group(0)
'"my s"tr"'
Mazdak
  • 105,000
  • 18
  • 159
  • 188