13

This feels like a really simple question, but I can't find the answer anywhere.

(Notes: I'm using Python, but this shouldn't matter.)

Say I have the following string:

s = "foo\nbar\nfood\nfoo"

I am simply trying to find a regex that will match both instances of "foo", but not "food", based on the fact that the "foo" in "food" is not immediately followed by either a newline or the end of the string.

This is perhaps an overly complicated way to express my question, but it gives something concrete to work with.

Here are some of the things I have tried, with results (Note: the result I want is [foo\n, foo]):

foo[\n\Z] => ['foo\n']

foo(\n\Z) => ['\n', ''] <= This seems to match the newline and EOS, but not the foo

foo($|\n) => ['\n', '']

(foo)($|\n) => [(foo,'\n'), (foo,'')] <= Almost there, and this is a useable plan B, but I would like to find the perfect solution.

The only thing I found that does work is:

foo$|foo\n => ['foo\n', `'foo']

This is fine for such a simple example, but it is easy to see how it could become unwieldy with a much larger expression (and yes, this foo thing is a stand in for the larger expression I am actually using).


Interesting aside: The closest SO question I could find to my problem was this one: In regex, match either the end of the string or a specific character

Here, I could simply substitute \n for my 'specific character'. Now, the accepted answer uses the regex /(&|\?)list=.*?(&|$)/. I notice that the OP was using JavaScript (question was tagged with the javascript tag), so maybe the JavaScript regex interpreter is different, but when I use the exact strings given in the question with the above regex in Python, I get bad results:

>>> findall("(&|\?)list=.*?(&|$)", "index.php?test=1&list=UL")
[('&', '')]
>>> findall("(&|\?)list=.*?(&|$)", "index.php?list=UL&more=1")
[('?', '&')]

So, I'm stumped.

Community
  • 1
  • 1
Ken Bellows
  • 6,711
  • 13
  • 50
  • 78

3 Answers3

12
>>> import re
>>> re.findall(r'foo(?:$|\n)', "foo\nbar\nfood\nfoo")
['foo\n', 'foo']

(?:...) makes a non-capturing group.

This works because (from re module reference):

re.findall(pattern, string, flags=0)

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

Phil Frost
  • 3,668
  • 21
  • 29
  • 1
    Huh. Why is it that using a non-capturing group instead of a standard group works? why doesn't plain old `r'foo($|\n)'` do the same thing? – Ken Bellows Dec 31 '12 at 17:01
  • Also, this is the one I wanted. Thanks very much! – Ken Bellows Dec 31 '12 at 17:03
  • 2
    If you have the `$|\n` in a normal group, you would match (and only match) the line breaks (as nothing else is in a capturing group). You could put the foo in a group as well, but then you would again end up with extra group results for the line breaks. – poke Dec 31 '12 at 17:04
  • 1
    @KenB: expanded the answer to address your question. – Phil Frost Dec 31 '12 at 17:06
  • Ahh. That definitely clears a few things up. Is this a Python-specific quirk with regards to handling groups, or would similar issues involving throwing away the non-grouped parts of the matches arise in Perl or Ruby or JavaScript, etc.? – Ken Bellows Dec 31 '12 at 17:12
  • 1
    @KenB: I don't know about Ruby, but it's similar in Perl. Try `@matches = "foo\nbar\nfood\nfoo" =~ /foo(?:$|\n)/g; print "@matches\n";`; but if you remove the `?:` to make it a matching group, you get similar results as you would in Python. – Phil Frost Dec 31 '12 at 17:24
5

You could use re.MULTILINE and include an optional linebreak after the $ in your pattern:

s = "foo\nbar\nfood\nfoo"
pattern = re.compile('foo$\n?', re.MULTILINE)
print re.findall(pattern, s)
# -> ['foo\n', 'foo']
omz
  • 53,243
  • 5
  • 129
  • 141
  • I like it, but I would really prefer to find a language agnostic solution. Since `re.MULTILINE` is Python specific, I's rather avoid it, for future use in other languages. – Ken Bellows Dec 31 '12 at 16:50
  • 3
    Most regular expression engines support a multiline option. You can also embed it directly in the pattern: `re.findall('(?m)foo$\n?', s)`. – omz Dec 31 '12 at 16:54
  • 1
    @KenB Exactly, flags like MULTILINE aren't Python specific, they just have different syntax on other languages (e.g in Perl `re.MULTILINE` would be `$s =~ /blah/m` or something). I never realised the flags can be included in the pattern, that's really useful to know, thanks! :D – dbr Dec 31 '12 at 17:04
  • This is true, but for ease of portability's sake I was simply looking for a regex that could basically be copy/pasted between languages and work out of the box. The `(?m)` flag is definitely something to look into, but I think the non-capturing group was more what I was looking for. – Ken Bellows Dec 31 '12 at 17:07
1

If you're only concerned with foo:

In [42]: import re

In [43]: strs="foo\nbar\nfood\nfoo"

In [44]: re.findall(r'\bfoo\b',strs)
Out[44]: ['foo', 'foo']

\b is denotes a word boundary:

\b

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string, so the precise set of characters deemed to be alphanumeric depends on the values of the UNICODE and LOCALE flags. For example, r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'. Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.

(Source)

Gareth Latty
  • 86,389
  • 17
  • 178
  • 183
Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
  • It might be worth explaining [`\b`](http://docs.python.org/2/library/re.html#regular-expression-syntax). (Edited in) – Gareth Latty Dec 31 '12 at 16:39
  • Again, `foo` is just a placeholder for a much more complicated expression. What I am really looking for is how to check against the end of the line or the end of the string. In many cases, using `\b` to check for word boundaries could break the expression. Good thought though. – Ken Bellows Dec 31 '12 at 16:41
  • 2
    @KenB Please give examples that actually show what you want - it's kind of hard to guess your requirements if you don't show them. – Gareth Latty Dec 31 '12 at 16:42
  • What I want is a generic solution, such that for any regular expression R, I could do something like `re.findall(R, str)` and have it work. My specific example really shouldn't matter for something so simple. I can give something more concrete if its really necessary, but I don't think it is. – Ken Bellows Dec 31 '12 at 16:45
  • @KenB give us better examples, otherwise it's hard to guess this `R`. – Ashwini Chaudhary Dec 31 '12 at 16:49