1

I have two strings where I want to isolate sequences of digits from everything else.

For example:

import re
s = 'abc123abc'
print(re.split('(\d+)', s))
s = 'abc123abc123'
print(re.split('(\d+)', s))

The output looks like this:

['abc', '123', 'abc']
['abc', '123', 'abc', '123', '']

Note that in the second case, there's a trailing empty string.

Obviously I can test for that and remove it if necessary but it seems cumbersome and I wondered if the RE can be improved to account for this scenario.

  • Do you actually want `['123', '123']` as a result? – mkrieger1 Oct 20 '21 at 12:45
  • In this case, you can use `s.strip("1234567890")`, but what are you going to do for a more complicated regular expression? – chepner Oct 20 '21 at 12:53
  • In any case, IMO this is a duplicate of [Python - re.split: extra empty strings that the beginning and end list](https://stackoverflow.com/questions/30924509/python-re-split-extra-empty-strings-that-the-beginning-and-end-list) Unless OP actually wants `['123', '123']` as a result, in which case the solution would be to use `findall`, not `split`. – mkrieger1 Oct 20 '21 at 12:54
  • @mkrieger1 I should have been clearer in my question. Sorry. I want the output as shown but without the trailing empty string. Use of *filter* does what I need –  Oct 20 '21 at 13:04

3 Answers3

0

You can use filter and don't return this empty string like below:

>>> s = 'abc123abc123'
>>> re.split('(\d+)', s)
['abc', '123', 'abc', '123', '']

>>> list(filter(None,re.split('(\d+)', s)))
['abc', '123', 'abc', '123']

By thanks @chepner you can generate list comprehension like below:

>>> [x for x in re.split('(\d+)', s) if x]
['abc', '123', 'abc', '123']

If maybe you have symbols or other you need split:

>>> s = '&^%123abc123$#@123'
>>> list(filter(None,re.split('(\d+)', s)))
['&^%', '123', 'abc', '123', '$#@', '123']
I'mahdi
  • 23,382
  • 5
  • 22
  • 30
  • 1
    You can also use a list comprehension or generator expression, with `x for x in re.split(...) if x`. – chepner Oct 20 '21 at 12:50
  • Just not elegant to overgenerate and then filter. Especially when the OP states that they can easily do that but don't want to. – user2390182 Oct 20 '21 at 13:03
0

A simple way to use regular expressions for this would be re.findall:

def bits(s):
    return re.findall(r"(\D+|\d+)", s)

bits("abc123abc123")
# ['abc', '123', 'abc', '123']

But it seems easier and more natural with itertools.groupby. After all, you are chunking an iterable based on a single condition:

from itertools import groupby

def bits(s):
    return ["".join(g) for _, g in groupby(s, key=str.isdigit)]

bits("abc123abc123")
# ['abc', '123', 'abc', '123']
user2390182
  • 72,016
  • 6
  • 67
  • 89
0

This has to do with the implementation of re.split() itself: you can't change it. When the function splits, it doesn't check anything that comes after the capture group, so it can't choose for you to either keep or discard the empty string that is left after splitting. It just splits there and leaves the rest of the string (which can be empty) to the next cycle.

If you don't want that empty string, you can get rid of it in various ways before collecting the results into a list. user1740577's is one example, but personally I prefer a list comprehension, since it's more idiomatic for simple filter/map operations:

parts = [part for part in re.split('(\d+)', s) if part]

I recommend against checking and getting rid of the element after the list has already been created, because it involves more operations and allocations.

theberzi
  • 2,142
  • 3
  • 20
  • 34