2

I'm studying Python 3 but I'm struggling to get regex with the re module.

Here's my problem: I have the string

phrase = "s000000000 s1133122 s21 s3 s4 s5212638476234857634 s6 s7 s8 s9000"

and, using the function

re.findall(pattern, phrase)

I'd like to extract:

  1. s0-s9 strings without the additional characters;
  2. s0-s3 strings without the additional characters;
  3. s0-s3 strings with the additional characters;
  4. s4-s9 strings with the additional characters.

I managed to accomplish the first three tasks by using these following patterns:

  1. pattern = "s[0-9]"
  2. pattern = "s[0-3]"
  3. pattern = "s[0-3]+"

For the last task, though, I tried to replicate what I did in the third one and used

pattern = "s[4-9]+"

but, instead of getting as result

["s4", "s5212638476234857634", "s6", "s7", "s8", "s9000"]

I get

["s4", "s5", "s6", "s7", "s8", "s9"]

Why is that? What am I missing? The instructions on the book I'm studying from states that the plus sign means "one or more characters", and the s[0-3]+ pattern in fact works, but I cannot make it work for this specific problem.

jnsen76
  • 23
  • 3
  • If you type `'s[4-9]+'` you match a string which starts with a `s` followed only by digits from 4 to 9. `"s5212638476234857634"` has numbers lower 4 and `"s9000"` does not match this rule, too. – mosc9575 Jan 27 '21 at 17:24

1 Answers1

1

You need to use

s[4-9]\d*

See the regex demo. Note: you might want to start matching from a word boundary if s should not be preceded with any word chars, \bs[4-9]\d*. In Python, it would look like r'\bs[4-9]\d*'.

Details:

  • s - an s char
  • [4-9] - a digit from 4 to 9
  • \d* - zero or more digits.

See the Python demo:

import re
rx = r"s[4-9]\d*"
text = "s000000000 s1133122 s21 s3 s4 s5212638476234857634 s6 s7 s8 s9000"
print( re.findall(rx, text) )
# => ['s4', 's5212638476234857634', 's6', 's7', 's8', 's9000']
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thank you, that's a great feature, my book doesn't mention the /d. The thing is: why s[0-3]+ works and s[4-9]+ doesn't? I cannot get it – jnsen76 Jan 27 '21 at 16:51
  • 1
    @jnsen76 `[0-3]+` matches one or more `0`, `1`, `2` or `3` chars, it does not match `4`, `5`, etc. – Wiktor Stribiżew Jan 27 '21 at 16:52
  • 1
    Ok nevermind, I played with the editor and I got why it doesn't work with [3-9]+. Thank you. The regex tester is a great thing BTW, thank you for introducing it to me – jnsen76 Jan 27 '21 at 16:58