0

The Question:

Given a list of strings create a function that returns the same list but split along any of the following delimiters ['&', 'OR', 'AND', 'AND/OR', 'IFT'] into a list of lists of strings.

Note the delimiters can be mixed inside a string, there can be many adjacent delimiters, and the list is a column from a dataframe.

EX//
function(["Mary & had a little AND lamb", "Twinkle twinkle ITF little OR star"])
>> [['Mary', 'had a little', 'lamb'], ['Twinkle twinkle', 'little', 'star']]

function(["Mary & AND had a little OR IFT lamb", "Twinkle twinkle AND & ITF little OR & star"])
>> [['Mary', 'had a little', 'lamb'], ['Twinkle twinkle', 'little', 'star']]

My Solution Attempt

Start by replacing any kind of delimiter with a &. I include spaces on either side so that other words like HANDY dont get affected. Next, split each string along the & delimiter knowing that every other kind of delimiter has been replaced.

def clean_and_split(lolon):  
  # Constants
  banned_list = {' AND ', ' OR ', ' ITF ', ' AND/OR '}

  # Loop through each list of strings
  for i in range(len(lolon)):
    # Loop through each delimiter and replace it with ' & '
    for word in banned_list:
      lolon[i] = lolon[i].replace(word, ' & ')
    # Split the string along the ' & ' delimiter
    lolon[i] = lolon[i].split('&')
  return lolon

The problem is that often side by side delimiters get replaced in a way that leaves an empty string in the middle. Also certain combinations of delimiters dont get removed. This is because when the 'replace' method reads ' OR OR OR ', it will replace the first ' OR ' (since it matches) but wont replace the second because it reads it as 'OR '.

EX//
clean_and_split(["Mario AND Luigi AND & Peach"]) >> ['Mario ', ' Luigi ', ' ', ' Peach'])

clean_and_split(["Mario OR OR OR Luigi", "Testing AND AND PlsWork "])
>> ['Mario ',' OR ', ' Luigi '], ['Testing', 'AND PlsWork]]

The work around to resolve this is to make banned_list = {' AND ', ' OR ', ' ITF ', ' AND/OR ', ' AND ', ' OR ', ' ITF ', ' AND/OR '} forcing the code to loop through everything twice.

Alternate Solution?

Split the column along a list of delimiters. The problem with this is that back to back delimiters don't get caught

    df['Correct_Column'].str.split('(?: AND | IFT | OR | & )')

EX//
function(["Mary & AND had a little OR IFT lamb", "Twinkle twinkle AND & ITF little OR & star"])
>> [['Mary', 'AND had a little', 'IFT lamb'], ['Twinkle twinkle', '& little', '& star']]

There HAS to be a more elegant way!

  • can you not just do something like: `import re; [re.split('(?:AND|IFT|OR|&)', x) for x in ["Mary & had a little AND lamb", "Twinkle twinkle ITF little OR star"]]`? – Shabble Jul 03 '22 at 21:08
  • 1
    to follow up on myself, no: it needs the word-boundariy match with `\b` to avoid the HANDY problem you mention: `[re.split('\b(?:AND|IFT|OR|&)\b', x) for x in ["Mary & had a little AND lamb", "Twinkle twinkle ITF little OR star not HANDY though"]]` – Shabble Jul 03 '22 at 21:10
  • When I run your code I only get a split on the comma: [['Mary & had a little AND lamb'], ['Twinkle twinkle ITF little OR star not HANDY though']] – Pierre Olivier Jul 03 '22 at 23:00
  • re.split('(?: AND | IFT | OR | & )', "Mary AND AND AND had a little IFT & OR lamb") works better but still does not work with two side by side delimiters that are exactly the same – Pierre Olivier Jul 03 '22 at 23:09

1 Answers1

1

This is where a lookahead and lookbehind are useful, as they won't eat up the spaces you use to match correctly:

import re

text = 'Mary & had a little AND OR lamb, white as ITF snow OR'

replaced = re.sub('(?<=\s)&|OR|AND|ITF|AND/OR(?=\s)', '&', text)
parts = [stripped for s in replaced.split('&') if (stripped := s.strip())]
print(parts)

Result:

['Mary', 'had a little', 'lamb, white as', 'snow']

However, note that:

  • the parts = line may solve most of your problems anyway, using your own method;
  • a lookbehind or lookahead requires a fixed-width pattern in Python, so something like (?<=\s|^) won't work, i.e. the OR at the end causes an empty string to be found at the end;
  • the lookahead/lookbehind correctly deals with 'AND OR', but still finds an empty string in between, which is removed on the parts = line;
  • the walrus operator is in the parts = line as a simple way to filter out empty strings; stripped := s.strip() is not truthy if the result is an empty string, so stripped will only show up in the list if it is not an empty string.
Grismar
  • 27,561
  • 4
  • 31
  • 54
  • When I run this code it works fine for 'AND's that are inside words, but for some reason it does not catch the 'OR's Ex// ["Martha & AND GEORGE HANDY & Ogracy Stewart"] >> ['Martha', 'GE', 'GE HANDY', 'Ogracy Stewart'] Also I get an error for this "[stripped for s in replaced.split('..." Saying name 'stripped' is not defined – Pierre Olivier Jul 04 '22 at 15:28
  • The latter means you're probably running an older version of Python, you should mention the version of Python you use in your question, if it's not the latest. As for the `OR` in `GEORGE`, I would expect you would want that to be captured, otherwise, what's the point - but if you don't, you can simply `sub('&|OR|AND|ITF|AND/OR', '&', text). You can replace `stripped` and `stripped := s.strip()` both with `s.strip()` to avoid using the walrus operator for older versions of Python. – Grismar Jul 05 '22 at 01:00