Regex for getting all digits in a string after a character

Question

I am trying to parse the following string and return all digits after the last square bracket:

C9: Title of object (foo, bar) [ch1, CH12,c03,4]

So the result should be:

1,12,03,4

The string and digits will change. The important thing is to get the digits after the '[' regardless of what character (if any) precede it. (I need this in python so no atomic groups either!) I have tried everything I can think of including:

 \[.*?(\d) = matches '1' only
 \[.*(\d) = matches '4' only
 \[*?(\d) = matches include '9' from the beginning

etc

Any help is greatly appreciated!

EDIT: I also need to do this without using str.split() too.

Is the format of the string always the same? Do you _need_ a regexp? — Vincent Savard, Dec 17 '15 at 15:45
no the format will change each time. The only thing that remains the same is that there will be an ID (C9 here) then some text (may include '[') then finally a set of characters I need. The characters __should__ have a CH but I ave found that these are left out a lot or 0 instead of O used. — Keith Burke, Dec 17 '15 at 15:53

Rohit Jain · Answer 1 · 2015-12-17T15:50:09.983

5

You can rather find all digits in the substring after the last [ bracket:

>>> s = 'C9: Title of object (fo[ 123o, bar) [ch1, CH12,c03,4]'
>>> # Get substring after the last '['.
>>> target_string = s.rsplit('[', 1)[1]
>>>
>>> re.findall(r'\d+', target_string)
['1', '12', '03', '4']

If you can't use split, then this one would work with look-ahead assertion:

>>> s = 'C9: Title of object (fo[ 123o, bar) [ch1, CH12,c03,4]'
>>> re.findall(r'\d+(?=[^[]+$)', s)
['1', '12', '03', '4']

This finds all digits, which are followed by only non-[ characters till the end.

edited Dec 17 '15 at 15:50

answered Dec 17 '15 at 15:45

Rohit Jain

209,639
45
409
525

Although it isn't in 1 expression, I think this is possibly the cleanest solution. Thank you! – Keith Burke Dec 17 '15 at 15:50
@KeithBurke Got a single expression too. :) – Rohit Jain Dec 17 '15 at 15:50
Hey @KeithBurke I don't understand how your second solution works. I don't know why it doesn't capture the 9 from C9 and the 123. This part (?=[^[]+$) confuses me a lot, how to you make to be inside the [], how do you capture only the numbers.... Could you explain it¿? – aDoN Dec 17 '15 at 16:06
@aDoN: This part: `(?=[^[]+$)` of regex is a look-ahead. `[^[]` inside this is a negated character class - matching all non-`[` characters. We don't need to escape `[` inside there. Now the whole regex means - *match all `\d+` pattern, which is followed by non-`[` characters till the end*. Note the `$` in the regex is important, else it will match all the digits. – Rohit Jain Dec 17 '15 at 16:11
So, for `C9`, when `\d+` matches the `9`, and regex engine starts matching the lookahead, it fails at the second last `[`, since it doesn't match the negated character class. Similary `123` fails at the last `[`. – Rohit Jain Dec 17 '15 at 16:12
Ahhhhh!! I thought it was intermediately followed by an `[` that's cool. I still don't understand well the `[^[]` part, why couldn't it be `?=^\]` instead¿? And why the parenthesis¿? Thank you very much – aDoN Dec 17 '15 at 16:17
@aDoN Parenthesis is required, because that is what makes `?=` a look-ahead. That's regex syntax. Again, `[^[]` is required because `^` negates something only inside a character class. Outside it means start of string. – Rohit Jain Dec 17 '15 at 16:25
Ohhh Yeah you are right, thank you very much the `[^[]` is necessary, so now, the only thing I don't understand is the `+`. If the expression was `re.findall(r'\d+(?=[^[].+$)', s)` with a `.`, it matches every number. So... Why `[^[]+$`. Thank you , your explanations are very clear. – aDoN Dec 17 '15 at 16:38
@aDoN Because a `.` will also match the `[`. Your pattern means, any digit that is followed by a single `[`, and then any character. See that every number matches this rule. – Rohit Jain Dec 17 '15 at 17:00
ahhh @RohitJain, yeah, you are right so it's like `^[+` (every thing that is not a `[`). I was thinking, why negative lookahead doesn't work here¿? I mean `re.findall(r'\d+(?!\[+$)', s)`. My intuition is, it searches the first number, the `9`, it looks ahead, it finds the `[` , the `+` after the `[` must work for only non-`[` characters so the `9` and the `123` shouldn't work I think but what happens is that it matches everything. Thank you very much – aDoN Dec 18 '15 at 10:00

Brian · Answer 2 · 2015-12-17T15:47:47.380

It may help to use the non-greedy ?. For example:

\[.*?(\d*?),.*?(\d*?),.*?(\d*?),.*?(\d*?)\]

And, here's how it works (from https://regex101.com/r/jP7hM3/1):

"\[.*?(\d*?),.*?(\d*?),.*?(\d*?),.*?(\d*?)\]"
\[ matches the character [ literally
.*? matches any character (except newline)
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
1st Capturing group (\d*?)
\d*? match a digit [0-9]
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
, matches the character , literally
.*? matches any character (except newline)
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
2nd Capturing group (\d*?)
\d*? match a digit [0-9]
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
, matches the character , literally
.*? matches any character (except newline)
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
3rd Capturing group (\d*?)
\d*? match a digit [0-9]
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
, matches the character , literally
.*? matches any character (except newline)
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
4th Capturing group (\d*?)
\d*? match a digit [0-9]
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
\] matches the character ] literally

Although - I have to agree with others... This is a regex solution, but its not a very pythonic solution.

There isn't always going to be only 4 digits inside the braces. There can be any number of them or none at all. — Keith Burke, Dec 17 '15 at 15:48
Ah, gotcha. You can do groups within groups, if you really do need to do this as a regex. — Brian, Dec 17 '15 at 15:50
@Brian Groups within a group will only give you last matched pattern as output. — Rohit Jain, Dec 17 '15 at 16:05
Oh, bummer. Then my answer likely won't help the OP, but I'll leave it in case its useful. — Brian, Dec 17 '15 at 17:50

Regex for getting all digits in a string after a character

2 Answers2

Linked

Related