Trailing empty string after re.split()

Question

I have two strings where I want to isolate sequences of digits from everything else.

For example:

import re
s = 'abc123abc'
print(re.split('(\d+)', s))
s = 'abc123abc123'
print(re.split('(\d+)', s))

The output looks like this:

['abc', '123', 'abc']
['abc', '123', 'abc', '123', '']

Note that in the second case, there's a trailing empty string.

Obviously I can test for that and remove it if necessary but it seems cumbersome and I wondered if the RE can be improved to account for this scenario.

In this case, you can use `s.strip("1234567890")`, but what are you going to do for a more complicated regular expression? — chepner, Oct 20 '21 at 12:53
In any case, IMO this is a duplicate of [Python - re.split: extra empty strings that the beginning and end list](https://stackoverflow.com/questions/30924509/python-re-split-extra-empty-strings-that-the-beginning-and-end-list) Unless OP actually wants `['123', '123']` as a result, in which case the solution would be to use `findall`, not `split`. — mkrieger1, Oct 20 '21 at 12:54
@mkrieger1 I should have been clearer in my question. Sorry. I want the output as shown but without the trailing empty string. Use of *filter* does what I need — , Oct 20 '21 at 13:04

I'mahdi · Accepted Answer · 2021-10-20T13:02:06.637

0

You can use filter and don't return this empty string like below:

>>> s = 'abc123abc123'
>>> re.split('(\d+)', s)
['abc', '123', 'abc', '123', '']

>>> list(filter(None,re.split('(\d+)', s)))
['abc', '123', 'abc', '123']

By thanks @chepner you can generate list comprehension like below:

>>> [x for x in re.split('(\d+)', s) if x]
['abc', '123', 'abc', '123']

If maybe you have symbols or other you need split:

>>> s = '&^%123abc123$#@123'
>>> list(filter(None,re.split('(\d+)', s)))
['&^%', '123', 'abc', '123', '$#@', '123']

edited Oct 20 '21 at 13:02

answered Oct 20 '21 at 12:48

I'mahdi

23,382
5
22
30

1

You can also use a list comprehension or generator expression, with `x for x in re.split(...) if x`. – chepner Oct 20 '21 at 12:50
Just not elegant to overgenerate and then filter. Especially when the OP states that they can easily do that but don't want to. – user2390182 Oct 20 '21 at 13:03

user2390182 · Answer 2 · 2021-10-20T12:55:54.343

A simple way to use regular expressions for this would be re.findall:

def bits(s):
    return re.findall(r"(\D+|\d+)", s)

bits("abc123abc123")
# ['abc', '123', 'abc', '123']

But it seems easier and more natural with itertools.groupby. After all, you are chunking an iterable based on a single condition:

from itertools import groupby

def bits(s):
    return ["".join(g) for _, g in groupby(s, key=str.isdigit)]

bits("abc123abc123")
# ['abc', '123', 'abc', '123']

score 0 · Answer 3 · answered Oct 20 '21 at 12:52

This has to do with the implementation of re.split() itself: you can't change it. When the function splits, it doesn't check anything that comes after the capture group, so it can't choose for you to either keep or discard the empty string that is left after splitting. It just splits there and leaves the rest of the string (which can be empty) to the next cycle.

If you don't want that empty string, you can get rid of it in various ways before collecting the results into a list. user1740577's is one example, but personally I prefer a list comprehension, since it's more idiomatic for simple filter/map operations:

parts = [part for part in re.split('(\d+)', s) if part]

I recommend against checking and getting rid of the element after the list has already been created, because it involves more operations and allocations.

Trailing empty string after re.split()

3 Answers3