4

I am trying to parse the following string and return all digits after the last square bracket:

C9: Title of object (foo, bar) [ch1, CH12,c03,4]

So the result should be:

1,12,03,4

The string and digits will change. The important thing is to get the digits after the '[' regardless of what character (if any) precede it. (I need this in python so no atomic groups either!) I have tried everything I can think of including:

 \[.*?(\d) = matches '1' only
 \[.*(\d) = matches '4' only
 \[*?(\d) = matches include '9' from the beginning

etc

Any help is greatly appreciated!

EDIT: I also need to do this without using str.split() too.

Keith Burke
  • 137
  • 2
  • 12
  • Is the format of the string always the same? Do you _need_ a regexp? – Vincent Savard Dec 17 '15 at 15:45
  • no the format will change each time. The only thing that remains the same is that there will be an ID (C9 here) then some text (may include '[') then finally a set of characters I need. The characters __should__ have a CH but I ave found that these are left out a lot or 0 instead of O used. – Keith Burke Dec 17 '15 at 15:53

2 Answers2

5

You can rather find all digits in the substring after the last [ bracket:

>>> s = 'C9: Title of object (fo[ 123o, bar) [ch1, CH12,c03,4]'
>>> # Get substring after the last '['.
>>> target_string = s.rsplit('[', 1)[1]
>>>
>>> re.findall(r'\d+', target_string)
['1', '12', '03', '4']

If you can't use split, then this one would work with look-ahead assertion:

>>> s = 'C9: Title of object (fo[ 123o, bar) [ch1, CH12,c03,4]'
>>> re.findall(r'\d+(?=[^[]+$)', s)
['1', '12', '03', '4']

This finds all digits, which are followed by only non-[ characters till the end.

Rohit Jain
  • 209,639
  • 45
  • 409
  • 525
  • Although it isn't in 1 expression, I think this is possibly the cleanest solution. Thank you! – Keith Burke Dec 17 '15 at 15:50
  • @KeithBurke Got a single expression too. :) – Rohit Jain Dec 17 '15 at 15:50
  • Hey @KeithBurke I don't understand how your second solution works. I don't know why it doesn't capture the 9 from C9 and the 123. This part (?=[^[]+$) confuses me a lot, how to you make to be inside the [], how do you capture only the numbers.... Could you explain it¿? – aDoN Dec 17 '15 at 16:06
  • @aDoN: This part: `(?=[^[]+$)` of regex is a look-ahead. `[^[]` inside this is a negated character class - matching all non-`[` characters. We don't need to escape `[` inside there. Now the whole regex means - *match all `\d+` pattern, which is followed by non-`[` characters till the end*. Note the `$` in the regex is important, else it will match all the digits. – Rohit Jain Dec 17 '15 at 16:11
  • So, for `C9`, when `\d+` matches the `9`, and regex engine starts matching the lookahead, it fails at the second last `[`, since it doesn't match the negated character class. Similary `123` fails at the last `[`. – Rohit Jain Dec 17 '15 at 16:12
  • Ahhhhh!! I thought it was intermediately followed by an `[` that's cool. I still don't understand well the `[^[]` part, why couldn't it be `?=^\]` instead¿? And why the parenthesis¿? Thank you very much – aDoN Dec 17 '15 at 16:17
  • @aDoN Parenthesis is required, because that is what makes `?=` a look-ahead. That's regex syntax. Again, `[^[]` is required because `^` negates something only inside a character class. Outside it means start of string. – Rohit Jain Dec 17 '15 at 16:25
  • Ohhh Yeah you are right, thank you very much the `[^[]` is necessary, so now, the only thing I don't understand is the `+`. If the expression was `re.findall(r'\d+(?=[^[].+$)', s)` with a `.`, it matches every number. So... Why `[^[]+$`. Thank you , your explanations are very clear. – aDoN Dec 17 '15 at 16:38
  • @aDoN Because a `.` will also match the `[`. Your pattern means, any digit that is followed by a single `[`, and then any character. See that every number matches this rule. – Rohit Jain Dec 17 '15 at 17:00
  • ahhh @RohitJain, yeah, you are right so it's like `^[+` (every thing that is not a `[`). I was thinking, why negative lookahead doesn't work here¿? I mean `re.findall(r'\d+(?!\[+$)', s)`. My intuition is, it searches the first number, the `9`, it looks ahead, it finds the `[` , the `+` after the `[` must work for only non-`[` characters so the `9` and the `123` shouldn't work I think but what happens is that it matches everything. Thank you very much – aDoN Dec 18 '15 at 10:00
-1

It may help to use the non-greedy ?. For example:

\[.*?(\d*?),.*?(\d*?),.*?(\d*?),.*?(\d*?)\]

And, here's how it works (from https://regex101.com/r/jP7hM3/1):

"\[.*?(\d*?),.*?(\d*?),.*?(\d*?),.*?(\d*?)\]"
\[ matches the character [ literally
.*? matches any character (except newline)
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
1st Capturing group (\d*?)
\d*? match a digit [0-9]
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
, matches the character , literally
.*? matches any character (except newline)
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
2nd Capturing group (\d*?)
\d*? match a digit [0-9]
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
, matches the character , literally
.*? matches any character (except newline)
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
3rd Capturing group (\d*?)
\d*? match a digit [0-9]
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
, matches the character , literally
.*? matches any character (except newline)
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
4th Capturing group (\d*?)
\d*? match a digit [0-9]
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
\] matches the character ] literally

Although - I have to agree with others... This is a regex solution, but its not a very pythonic solution.

Brian
  • 2,172
  • 14
  • 24