1

I have strings like this example

"BODY: 88% RECYCLED POLYESTER, 12% ELASTANE GUSSET LINING: 91% COTTON, 9% ELASTANE EXCLUSIVE OF DECORATION"

And I want to split them so that a word with a colon starts a new list item, while keeping that colon word

["BODY: 77% RECYCLED POLYESTER, 23% ELASTANE", "MESH: 84% POLYAMIDE, 16% ELASTANE EXCLUSIVE OF DECORATION"]

I came up with

re.split("\s(\w+:.+)", p)

But this returns an empty string at the end and I'm not sure why

['BODY: 77% RECYCLED POLYESTER, 23% ELASTANE', 'MESH: 84% POLYAMIDE, 16% ELASTANE EXCLUSIVE OF DECORATION', '']
tape74
  • 75
  • 1
  • 9
  • 1
    Does this answer your question? [Split by regex without resulting empty strings in Python](https://stackoverflow.com/questions/30933216/split-by-regex-without-resulting-empty-strings-in-python) – Charles Dupont May 11 '21 at 22:52

1 Answers1

2

You can use re.split(r"\s(?=\w+:)", s). I added a lookahead ?= to ensure the split occurs only on the space character that has the \w+: pattern following it.

The original attempt includes the entire pattern in the split group leading to undesirable results (if you include multiple word: groups, you'll see there are bigger problems than just the trailing empty string).

Here's a comparison:

>>> s = "foo: bar bar baz: asdfa sdfasd quux: zzzz"
>>> #                ^                 ^
>>> # we want to split on the highlighted space characters above
>>>
>>> re.split(r"\s(\w+:.+)", s) # incorrect
['foo: bar bar', 'baz: asdfa sdfasd quux: zzzz', '']
>>> re.split(r"\s(?=\w+:)", s) # correct
['foo: bar bar', 'baz: asdfa sdfasd', 'quux: zzzz']

If you want to handle splitting on multiple spaces, you can use r"\s+(?=\w+:)".

Note also raw strings should be used for all regex literals to ensure nothing is inadvertently escaped.

ggorlen
  • 44,755
  • 7
  • 76
  • 106