4

Assuming I have the following string:

str = """
         HELLO 1 Stop #$**& 5.02‼️ 16.1 
         regex

         5 ,#2.3222
      """

I want to export all numbers , Whether int or float after the word "stop" with no case sensitive . so the expected results will be :

[5.02, 16.1, 5, 2.3222]

The farthest I have come so far is by using PyPi regex from other post here:

regex.compile(r'(?<=stop.*)\d+(?:\.\d+)?', regex.I)

but this expression gives me only [5.02, 16.1]

yair_elmaliah
  • 107
  • 2
  • 8

4 Answers4

5

Yet another one, albeit with the newer regex module:

(?:\G(?!\A)|Stop)\D+\K\d+(?:\.\d+)?

See a demo on regex101.com.


In Python, this could be

import regex as re

string = """
         HELLO 1 Stop #$**& 5.02‼️ 16.1 
         regex

         5 ,#2.3222
      """

pattern = re.compile(r'(?:\G(?!\A)|Stop)\D+\K\d+(?:\.\d+)?')

numbers = pattern.findall(string)
print(numbers)

And would yield

['5.02', '16.1', '5', '2.3222']

Don't name your variables after inbuilt-functions, like str, list, dict and the like.


If you need to go further and limit your search within some bounds (e.g. all numbers between Stop and end), you could as well use

(?:\G(?!\A)|Stop)(?:(?!end)\D)+\K\d+(?:\.\d+)?
#           ^^^        ^^^

See another demo on regex101.com.

Jan
  • 42,290
  • 8
  • 54
  • 79
  • Nice one, Jan. `^.*\bStop\b(\D*\K\d+(?:\.\d+)?)|\G(?1)` seems to also work with PCRE2 ([Demo](https://regex101.com/r/sALreI/1)), where `(?1)` is a subroutine that reuses the code in capture group 1), but I can't get it to work [here](https://tio.run/##K6gsycjPM/7/PzO3IL@oRKEoNT21QiGxGMjg4uIqLilSsFVQUlLiUoABD1cfH38FQ4XgkvwCBWUVLS01BVM9A6NHDXve7@hXMDTTM1RAKAabxoXgmyroKBvpGRsZGUHFQEZzFaVXAG0pUo/T04pJApkbk6QR46IV4x2Toq1hbxWjB6Q17TVrYtw17A011bm4Cooy80o0ilL10jLzUhJzcjSAJugoAB2rqcn1/z8A), but... – Cary Swoveland Nov 06 '21 at 04:36
  • ...your regex works fine [there](https://tio.run/##K6gsycjPM/7/PzO3IL@oRKEoNT21QiGxGMjg4uIqLilSsFVQUlLiUoABD1cfH38FQ4XgkvwCBWUVLS01BVM9A6NHDXve7@hXMDTTM1RAKAabxoXgmyroKBvpGRsZGUHFQEZzFaVXAG0pUtewt4px17BXjHHUrAEZrxnjoh3jHZOiDZLQA9Ka9upcXAVFmXklGkWpemmZeSmJOTkaQO06CkCXampy/f8PAA). Do you see a problem with my variant? – Cary Swoveland Nov 06 '21 at 04:37
  • @CarySwoveland: It does not really work, see [**here**](https://tio.run/##RY4xqsJAEIb7OcUQRXejLGZFi4CkURSeYGHpKihG3SKbZbNFAq94N/A6r/A0XsAjxDEoTjPz//x8/9jKX3IzrGud2dx5dOk5LXFf0AEAhXc4wSAIAD@zmC2XK4xw7XOLrXYYdnAkBvL@d3v8XzEaiwi/4YYGXz3CfkuKoZTy7b3Q4M4ltbjuToTq8OKqA1PTUP2oY48lsRK0ecJ/1ZwlEe8CnHKHGWpDfHHS5qh96hhR@kgP87hhk6iImonC7g3jjWedNp5RZlPG1ZZDXT8B) (you have two many matches, see the `"1"`). You can get it to work with the matched objects. Capture groups matched inside of recursion are not accessible, see the [documentation](https://perldoc.perl.org/perlre). Did that help? – Jan Nov 06 '21 at 08:33
  • Yes, helpful, but I still don't understand why it works with [PCRE2](https://regex101.com/r/WKJEkc/1/) but not at Tio. – Cary Swoveland Nov 06 '21 at 19:59
  • @CarySwoveland Here is [the correct demo of your solution](https://tio.run/##LY1NCsIwFIT37xSPVmxSJdhIXQjSjaKg4MKlUVD6YxY2IY1QwYU38DouPI0X8Ag1FmczM4v5Rl/tSZXDppFnrYxFkxVZjYfKBQCocIKe5y1mq9UaI9xYpdHvhGEXYzbg7/vr83xgNGIRArZq5/AvMfZ9zoacc8cAMEXtcCbYs1AcfyhxJGIaiqVIeyQZC@acJvQm5iSJaACgjSwt2dasMOqiCcVcGaxRlu6G5bJMpc0Mcdg@VnRHoWm@). You might want to use `rgx = r'(?ms)^.*?\bStop\b(\D*\K\d+(?:\.\d+)?)|\G(?1)'` here, probably, with `[x.group() for x in re.finditer(rgx, s)]`. – Wiktor Stribiżew Nov 15 '21 at 08:26
  • @CarySwoveland It seems all captures are lost when you recurse the pattern with a subroutine and use `regex.findall`. With `regex.finditer` and your regex in the Python code at tio, you get the whole matches, but even numbers before `Stop` are extracted, because of `\G(?1)`, as `\G` also matches the start of a string, and your string in Python code starts with a line break, and the `.*` does not match across lines as you did not use the `re.DOTALL` flag. See [this regex demo](https://regex101.com/r/sALreI/2) showing how it is matched. – Wiktor Stribiżew Nov 15 '21 at 08:32
3

You get only the first 2 numbers, as .* does not match a newline.

You can add update the flags to regex.I | regex.S to have the dot match a newline.

import regex

text = """
         HELLO 1 Stop #$**& 5.02‼️ 16.1 
         regex

         5 ,#2.3222
      """

pattern = regex.compile(r'(?<=\bstop\b.*)\d+(?:\.\d+)?', regex.I | regex.S)

print(regex.findall(pattern, text))

Output

['5.02', '16.1', '5', '2.3222']

See a Python demo


If you want to print the numbers after the word "stop", you can also use python re and match stop, and then capture in a group all that follows.

Then you can take that group 1 value, and find all the numbers.

import re
 
text = """
         HELLO 1 Stop #$**& 5.02‼️ 16.1 
         regex
 
         5 ,#2.3222
      """
pattern = r"\bStop\b(.+)"
 
m = re.search(pattern, text, re.S|re.I)
 
if m:
    print(re.findall(r"\d+(?:\.\d+)*", m.group(1)))

Output

['5.02', '16.1', '5', '2.3222']
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
0

You could use:

inp = """
HELLO 1 Stop #$**& 5.02‼️ 16.1 
regex

5 ,#2.3222"""

nums = []
if re.search(r'\bstop\b', inp, flags=re.I):
    inp = re.sub(r'^.*?\bstop\b', '', inp, flags=re.S|re.I)
    nums = re.findall(r'\d+(?:\.\d+)?', inp)

print(nums)  # ['5.02', '16.1', '5', '2.3222']

The if logic above ensures that we only attempt to populate the array of numbers if we are certain that Stop appears in the input text. Otherwise, the default output is just an empty array. If Stop does appear, then we strip off that leading portion of the string before using re.findall to find all numbers appearing afterwards.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
0
import re

_string = """
          HELLO 1 Stop #$**& 5.02‼️ 16.1
          regex

          5 ,#2.3222
       """

start = _string.find("Stop") + len("Stop")
print(re.findall("[-+]?\d*\.?\d+", _string[start:]))   # ['5.02', '16.1', '5', '2.3222']

maya
  • 1,029
  • 1
  • 2
  • 7