Extract all numbers (int and floats) after specific word

Question

Assuming I have the following string:

str = """
         HELLO 1 Stop #$**& 5.02‼️ 16.1 
         regex

         5 ,#2.3222
      """

I want to export all numbers , Whether int or float after the word "stop" with no case sensitive . so the expected results will be :

[5.02, 16.1, 5, 2.3222]

The farthest I have come so far is by using PyPi regex from other post here:

regex.compile(r'(?<=stop.*)\d+(?:\.\d+)?', regex.I)

but this expression gives me only [5.02, 16.1]

What have you tried so far? Do you have a _specific_ question? — John Gordon, Nov 05 '21 at 02:08

Jan · Answer 1 · 2021-11-05T09:36:10.583

5

Yet another one, albeit with the newer regex module:

(?:\G(?!\A)|Stop)\D+\K\d+(?:\.\d+)?

See a demo on regex101.com.

In Python, this could be

import regex as re

string = """
         HELLO 1 Stop #$**& 5.02‼️ 16.1 
         regex

         5 ,#2.3222
      """

pattern = re.compile(r'(?:\G(?!\A)|Stop)\D+\K\d+(?:\.\d+)?')

numbers = pattern.findall(string)
print(numbers)

And would yield

['5.02', '16.1', '5', '2.3222']

Don't name your variables after inbuilt-functions, like str, list, dict and the like.

If you need to go further and limit your search within some bounds (e.g. all numbers between Stop and end), you could as well use

(?:\G(?!\A)|Stop)(?:(?!end)\D)+\K\d+(?:\.\d+)?
#           ^^^        ^^^

See another demo on regex101.com.

edited Nov 05 '21 at 09:36

answered Nov 05 '21 at 08:40

Jan

42,290
8
54
79

Nice one, Jan. `^.*\bStop\b(\D*\K\d+(?:\.\d+)?)|\G(?1)` seems to also work with PCRE2 ([Demo](https://regex101.com/r/sALreI/1)), where `(?1)` is a subroutine that reuses the code in capture group 1), but I can't get it to work [here](https://tio.run/##K6gsycjPM/7/PzO3IL@oRKEoNT21QiGxGMjg4uIqLilSsFVQUlLiUoABD1cfH38FQ4XgkvwCBWUVLS01BVM9A6NHDXve7@hXMDTTM1RAKAabxoXgmyroKBvpGRsZGUHFQEZzFaVXAG0pUo/T04pJApkbk6QR46IV4x2Toq1hbxWjB6Q17TVrYtw17A011bm4Cooy80o0ilL10jLzUhJzcjSAJugoAB2rqcn1/z8A), but... – Cary Swoveland Nov 06 '21 at 04:36
...your regex works fine [there](https://tio.run/##K6gsycjPM/7/PzO3IL@oRKEoNT21QiGxGMjg4uIqLilSsFVQUlLiUoABD1cfH38FQ4XgkvwCBWUVLS01BVM9A6NHDXve7@hXMDTTM1RAKAabxoXgmyroKBvpGRsZGUHFQEZzFaVXAG0pUtewt4px17BXjHHUrAEZrxnjoh3jHZOiDZLQA9Ka9upcXAVFmXklGkWpemmZeSmJOTkaQO06CkCXampy/f8PAA). Do you see a problem with my variant? – Cary Swoveland Nov 06 '21 at 04:37
@CarySwoveland: It does not really work, see [**here**](https://tio.run/##RY4xqsJAEIb7OcUQRXejLGZFi4CkURSeYGHpKihG3SKbZbNFAq94N/A6r/A0XsAjxDEoTjPz//x8/9jKX3IzrGud2dx5dOk5LXFf0AEAhXc4wSAIAD@zmC2XK4xw7XOLrXYYdnAkBvL@d3v8XzEaiwi/4YYGXz3CfkuKoZTy7b3Q4M4ltbjuToTq8OKqA1PTUP2oY48lsRK0ecJ/1ZwlEe8CnHKHGWpDfHHS5qh96hhR@kgP87hhk6iImonC7g3jjWedNp5RZlPG1ZZDXT8B) (you have two many matches, see the `"1"`). You can get it to work with the matched objects. Capture groups matched inside of recursion are not accessible, see the [documentation](https://perldoc.perl.org/perlre). Did that help? – Jan Nov 06 '21 at 08:33
Yes, helpful, but I still don't understand why it works with [PCRE2](https://regex101.com/r/WKJEkc/1/) but not at Tio. – Cary Swoveland Nov 06 '21 at 19:59
@CarySwoveland Here is [the correct demo of your solution](https://tio.run/##LY1NCsIwFIT37xSPVmxSJdhIXQjSjaKg4MKlUVD6YxY2IY1QwYU38DouPI0X8Ag1FmczM4v5Rl/tSZXDppFnrYxFkxVZjYfKBQCocIKe5y1mq9UaI9xYpdHvhGEXYzbg7/vr83xgNGIRArZq5/AvMfZ9zoacc8cAMEXtcCbYs1AcfyhxJGIaiqVIeyQZC@acJvQm5iSJaACgjSwt2dasMOqiCcVcGaxRlu6G5bJMpc0Mcdg@VnRHoWm@). You might want to use `rgx = r'(?ms)^.*?\bStop\b(\D*\K\d+(?:\.\d+)?)|\G(?1)'` here, probably, with `[x.group() for x in re.finditer(rgx, s)]`. – Wiktor Stribiżew Nov 15 '21 at 08:26
@CarySwoveland It seems all captures are lost when you recurse the pattern with a subroutine and use `regex.findall`. With `regex.finditer` and your regex in the Python code at tio, you get the whole matches, but even numbers before `Stop` are extracted, because of `\G(?1)`, as `\G` also matches the start of a string, and your string in Python code starts with a line break, and the `.*` does not match across lines as you did not use the `re.DOTALL` flag. See [this regex demo](https://regex101.com/r/sALreI/2) showing how it is matched. – Wiktor Stribiżew Nov 15 '21 at 08:32

The fourth bird · Answer 2 · 2021-11-05T08:57:40.950

3

You get only the first 2 numbers, as .* does not match a newline.

You can add update the flags to regex.I | regex.S to have the dot match a newline.

import regex

text = """
         HELLO 1 Stop #$**& 5.02‼️ 16.1 
         regex

         5 ,#2.3222
      """

pattern = regex.compile(r'(?<=\bstop\b.*)\d+(?:\.\d+)?', regex.I | regex.S)

print(regex.findall(pattern, text))

Output

['5.02', '16.1', '5', '2.3222']

See a Python demo

If you want to print the numbers after the word "stop", you can also use python re and match stop, and then capture in a group all that follows.

Then you can take that group 1 value, and find all the numbers.

import re
 
text = """
         HELLO 1 Stop #$**& 5.02‼️ 16.1 
         regex
 
         5 ,#2.3222
      """
pattern = r"\bStop\b(.+)"
 
m = re.search(pattern, text, re.S|re.I)
 
if m:
    print(re.findall(r"\d+(?:\.\d+)*", m.group(1)))

Output

['5.02', '16.1', '5', '2.3222']

edited Nov 05 '21 at 08:57

answered Nov 05 '21 at 08:13

The fourth bird

154,723
16
55
70

1

Ha! My now-deleted answer may look familiar. – Cary Swoveland Nov 05 '21 at 08:26
@CarySwoveland You can also you the inline modifiers as well :-) – The fourth bird Nov 05 '21 at 08:28
2

`\G` might be better suited (+1), variable lookbehinds are expensive. – Jan Nov 05 '21 at 08:41

score 0 · Answer 3 · answered Nov 05 '21 at 02:12

You could use:

inp = """
HELLO 1 Stop #$**& 5.02‼️ 16.1 
regex

5 ,#2.3222"""

nums = []
if re.search(r'\bstop\b', inp, flags=re.I):
    inp = re.sub(r'^.*?\bstop\b', '', inp, flags=re.S|re.I)
    nums = re.findall(r'\d+(?:\.\d+)?', inp)

print(nums)  # ['5.02', '16.1', '5', '2.3222']

The if logic above ensures that we only attempt to populate the array of numbers if we are certain that Stop appears in the input text. Otherwise, the default output is just an empty array. If Stop does appear, then we strip off that leading portion of the string before using re.findall to find all numbers appearing afterwards.

score 0 · Answer 4 · answered Nov 05 '21 at 03:20

import re

_string = """
          HELLO 1 Stop #$**& 5.02‼️ 16.1
          regex

          5 ,#2.3222
       """

start = _string.find("Stop") + len("Stop")
print(re.findall("[-+]?\d*\.?\d+", _string[start:]))   # ['5.02', '16.1', '5', '2.3222']

Extract all numbers (int and floats) after specific word

4 Answers4