4

Does the re module of Python3 offer an in-build way to get the match and the rest (none-match) back?

Here is a simple example:

>>> import re
>>> p = r'\d'
>>> s = '1a'
>>> re.findall(p, s)
['1']

The result I want is something like ['1', 'a'] or [['1'], ['a']] or something else where I can differentiate between match and rest.

Of course can subtract the resulting (matching) string from the original one to get the rest. But is there an in build way for this?

I do not set the regex tag here because the question is less related to RegEx itself but more to a feature of a Python package.

buhtz
  • 10,774
  • 18
  • 76
  • 149
  • 2
    Is the part you're trying to match always in the front (in the example yes, but you've asked a general question), and do you mean by _rest_ only the part _after_ the match (and not a potential part _before_ it)? – Timus Mar 23 '22 at 15:34

3 Answers3

4

You can match everything and create groups to "split" between the important part from the rest:

>>> import re
>>> p = r'(\d+)(.*)'
>>> s = '12a\n34b\ncde'
>>> re.findall(p, s)
[('12', 'a'), ('34', 'b')]

re.findall documentation

Nilton Moura
  • 313
  • 2
  • 8
  • 2
    Isn't that more or less identical with [this](https://stackoverflow.com/a/71589548/14311263) answer? – Timus Mar 23 '22 at 15:41
  • 1
    Thank you. It's similar, but it wasn't here when I checked the question and started to write an answer. I can see that both answers were posted with a short time of difference. I'll leave my answer here since it's conciser and has an example with more 'lines' of data - also I wanted to exemplify that the regex from the author's question will miss lines that don't have digits. – Nilton Moura Mar 23 '22 at 16:30
3

Possible solution is the following:

import re

string = '1a'
re_pattern = r'^(\d+)(.*)'

result = re.findall(re_pattern, string)
print(result)

Returns list of tuples

[('1', 'a')]

or if you like to return list of str items

result = [item for t in re.findall(re_pattern, string) for item in t]
print(result)

Returns

['1', 'a']

Explanations to the code:

  • re_pattern = r'(\d+)(.*)' is looking for two groups: 1st group (\d+) means digits one or more, 2nd group (.*) means the rest of the string.
  • re.findall(re_pattern, string) returns list of tuple like [('1', 'a')]
  • list comprehension converts list of tuples to list of string items
gremur
  • 1,645
  • 2
  • 7
  • 20
  • This works but I do not understand why. You modified the pattern for this, correct? Can you explain what happens here and possible link to the corrosponding regex docu. **EDIT**: Fits perfect to my needs! – buhtz Mar 23 '22 at 15:23
2

No, the match does not show the data that was cut off by itself.

The Match object that a regex gives you contains information about where data was found, you could extract it with that

import re
p = r'\d(?<=)'
s = '1a'
match = next(re.finditer(p, s))
# >>> match
# <re.Match object; span=(0, 1), match='1'>

head = match.string[:match.start()]  # ""
tail = match.string[match.end():]  # "a"

Note that re.findall doesn't give you Match-objects, you'll need another function that does that, like re.finditer. I'm using next() here because it returns an iterator instead of a list, you'd usually cast it to a list or loop over it.


Another option would be to make these groups in your pattern directly.

If you're interested in both, before and after the match:

import re
p = r'(^.*?)(\d)(.*$)'
s = '1a'
re.findall(p, s)
# [('', '1', 'a')]

But this will not give you multiple results results in the same string, as they are overlapping and you can't have variable-with lookbehinds in the builtin re library.

If you're only interested in the string after the match, then you can do that

import re
p = r'(\d)(?=(.*))'
s = '1a'
re.findall(p, s)
# [('1', 'a')]
s = '1a2b'
re.findall(p, s)
# [('1', 'a2b'), ('2', 'b')]
Talon
  • 1,775
  • 1
  • 7
  • 15
  • 1
    Please fit your answer code to my MWE. That is why I created an MWE. And this does not answer my question because the split is down by the user itself not by the `re` package. – buhtz Mar 23 '22 at 15:19
  • 1
    I still don't fully understand the question's intention, so I might be wrong, but I think this is the best answer: (1) The `match.start()`/`stop()` is the best `re`-mechanic to capture _all_ the non-match parts of the string. (2) The use of the non-consuming lookahead is superior to the simpler solutions (depending on the use case a `re.DOTALL` flag might be helpful ... or not). +1 – Timus Mar 23 '22 at 18:47