0

I am new to regex and could use some help. Each block is separated by two new line characters \n\n. I need to get the amount of dogs but only if that block contains a medium sized dog

I have the string

"4211 dogs ate 2 pounds
    chris (large)

3454 dogs ate 8 pounds
    john (medium)
    alex (small)

4211 dogs ate 2 pounds
    morgan (small)
"
//regex \d+(?=\sdogs\sate\s\d+\spounds[\s\S]*(?!\n\n)\(medium\))

using this regex:
\d+(?=\sdogs\sate\s\d+\spounds[\s\S]*(?!\n\n)\(medium\))
almost works. But the problem with it is that when it finds the pattern \n\n it doesn't stop until it finds the last occurrence of \n\n. I need it to stop when it finds the first occurrence of \n\n not the last, in order to prevent it from finding patterns in other blocks.

Yunnosch
  • 26,130
  • 9
  • 42
  • 54

2 Answers2

1

You could use

^                 # match the start of the line in multiline mode
(?P<amount>\d+)   # capture the number of dogs
(?:(?!^$)[\s\S])+ # do not overrun an empty line, matching every character
\(medium\)        # look for (medium)

See a demo on regex101.com (and mind the modifiers!).


An alternative solution would be to split on empty lines (^$ with the multiline flag set) and check for (medium) in the resulting blocks.
Jan
  • 42,290
  • 8
  • 54
  • 79
1

PCRE with a capture group:

(?m)^(\d+) dogs ate \d+ pounds\n(?>.+\n)*?.*\(medium\)

without:

(?m)^\d+(?= dogs ate \d+ pounds\n(?>.+\n)*?.*\(medium\))

Javascript/Python with a capture group:

(?m)^(\d+) dogs ate \d+ pounds\n(?:.+\n)*?.*\(medium\)

without:

(?m)^\d+(?= dogs ate \d+ pounds\n(?:.+\n)*?.*\(medium\))

The key with these patterns is that each eventual line before (medium) is described using .+ that enforces at least one character (in other words, it isn't a blank line).

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • I like this one as well (+1) as it requires less steps than mine. – Jan Dec 19 '17 at 20:33
  • @Jan: Thanks, note that anchoring the pattern *(in particular before a non-literal)* makes a huge difference. – Casimir et Hippolyte Dec 19 '17 at 20:36
  • As you see, I did anchor my expression as well - the difference lies in the tempered greedy token which is of course more "expensive" than `.+`. – Jan Dec 19 '17 at 20:37
  • I didn't see. Indeed `.+` is faster. To emulate the atomic group in Javascript and Python, you can also rewrite `(?>.+\r?\n)*?` as `(?:(?=(.+\r?\n))\2)*?`. – Casimir et Hippolyte Dec 19 '17 at 20:41