1

My question came along when trying to help in this post: split an enumerated text list into multiple columns

I'm searching for a Regex pattern which splits this string at 1., 2. and 3. or in general: split after a digit (or more if the list would be longer) followed by a dot. Problem is that there are more numbers in the string which are needed.

test_string = '1. Fruit 12 oranges 2. vegetables 7 carrot 3. NFL 246 SHIRTS'

With this pattern I managed to do so, but I got an empty string at the start and didn't know how to change that.

l1 = re.split(r"\s?\d{1,2}\.", test_string)

# Output l1:
['', ' Fruit 12 oranges', ' vegetables 7 carrot', ' NFL 246 SHIRTS']

So I changed from "split it" to "search something that finds the pattern":

l2 = re.findall(r"(?:^|(?<=\d\.))([\sa-zA-Z0-9]+)(?:\d\.|$)", pattern)

# Output l2:
[' Fruit 12 oranges ', ' vegetables 7 carrot ', ' NFL 246 SHIRTS']

It is really close to be fine with it, just the trailing whitespace at the beginning of every element in the list.

What would be a good and efficient approach for my task? Stick with the splitting with re.split() or building a pattern and use re.findall()? Is my pattern good like I have done it or is it way too complicated?

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
Rabinzel
  • 7,757
  • 3
  • 10
  • 30
  • 1
    You get a leading `''` because the string starts with a delimiter (so what appears before the first one is an empty string). – Scott Hunter Mar 04 '22 at 17:27
  • I figured out why there is an empty string, I just didn't know how to avoid that :D – Rabinzel Mar 04 '22 at 17:28
  • If that's the only problem, just remove it. – Scott Hunter Mar 04 '22 at 17:29
  • 1
    True. That would be the easiest way. As you may can see my question isn't from a personal project where I just need a solution, I want to get a better understanding of regular expressions, and yeah....I just want to find a way where this is right in the first place. – Rabinzel Mar 04 '22 at 17:34

1 Answers1

1

By just adding twice (?:\s) to your expression:

re.findall(r"(?:^|(?<=\d\.))(?:\s)([\sa-zA-Z0-9]+)(?:\s\d\.|$)", test_string)

The output is: ['Fruit 12 oranges', 'vegetables 7 carrot', 'NFL 246 SHIRTS']

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
rehaqds
  • 414
  • 2
  • 6
  • thanks! I added and deleted `\s` on different places but I guess I didn't try out the right one :P In total you agree with my approach? using Lookahead and Lookbehind to get everything in between ? – Rabinzel Mar 04 '22 at 19:35
  • 1
    I didn't find a better approach but maybe there is! About the empty string given by split, it seems inevitable according to [this](https://stackoverflow.com/questions/30933216/split-by-regex-without-resulting-empty-strings-in-python) – rehaqds Mar 04 '22 at 20:06
  • Note that `(?:\s)` = `\s`, there is no need in the redundant non-capturing group. It is only necessary when you need to quantify a group of patterns or use alternation. – Wiktor Stribiżew Mar 04 '22 at 20:50
  • @rehaqds nice post, thanks for that. It does make a bit more sense now to me ! – Rabinzel Mar 05 '22 at 09:20
  • 1
    @WiktorStribiżew also checked that one in my code. thanks for pointing that out. – Rabinzel Mar 05 '22 at 09:25