I want to take an input string and split it up into chunks. The splits should occur when we hit a word in city_list (eg. city_list = ['Berlin']), and include the next four words (spaces and special characters don't contribute to the count but should be included). The important parameters are that words in city_list must always be at the start of a chunk; splits must only be made on spaces and not within words; all formatting must be preserved in the output (all spaces, punctuation, special characters, \r\n, etc.); and chunks with strings from cities_list in them must have at least 4 words in the chunk (other chunks can have fewer words if necessary).
For instance:
# Input string
test2 = ' Department of Medical Affairs\r\n North Louisiana Health Care System\r\n 500 Lancaster Rd \r\n Berlin, TX 7526\r\n'
# Desired output
output2 = [
' ',
'Department of Medical Affairs\r\n',
' ',
'North Louisiana Health Care System\r\n',
' ',
'500 Lancaster Rd \r\n',
' ',
'Berlin, TX 7526\r\n'
]
# Output of ''.join(output2): ' Department of Veterans Affairs\r\n North Texas Health Care System\r\n 500 Lancaster Rd \r\n Berlin, TX 7526\r\n'
# So test2 == ''.join(output2) yields True
The point of all this is to be able to look ahead after the city (eg., Berlin) to see if there is a state or other location indicator that is listed within a few words ahead of it and redact the city name (I already have a function that deals with that). I can't search the entire string for "Berlin" at once because I need to deal with each occurrence of that word on a case-by-case basis (eg., there could be a city named "Cornelius" and a person named "Cornelius", so chunking the string around the city keyword would help provide context for each case).
Here's an attempt; I keep getting an index error when I test it out. I understand why but don't know how to fix it.
def split_by_city(text, city_list):
result = []
words = text.split() # split the text into a list of words
i = 0 # initialize a counter for iterating over the words
while i < len(words): # loop over the words
if words[i] in city_list: # if the word is in the city list
j = i+1 # start a new counter to iterate over the next 4 words
while j < len(words) and j-i < 5: # loop over the next 4 words or until the end of the text
if words[j][-1] not in ['\r', '\n']: # exclude new line characters at the end of the word
if result and result[-1][-1] not in ['\r', '\n']: # add space between words only if previous word doesn't end with newline character
result[-1] += " " # add a space to the last chunk
result[-1] += words[j] # add the current word to the last chunk
j += 1 # increment the counter for iterating over the next 4 words
if result and result[-1][-1] not in ['\r', '\n']: # add a space to the last chunk if it doesn't end with newline character
result[-1] += " "
else: # if the word is not in the city list
if not result or result[-1][-1] in ['\r', '\n']: # start a new chunk if it's the first word or the previous word ends with newline character
result.append("")
if result and result[-1][-1] not in ['\r', '\n']: # add a space to the last chunk if it doesn't end with newline character
result[-1] += " "
result[-1] += words[i] # add the current word to the last chunk
i += 1 # increment the counter for iterating over the words
return result # return the list of chunks
split_by_city(test2, ['Berlin'])
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-111-4ee705e7672a> in <module>
----> 1 split_by_city4(test2, ['Berlin'])
<ipython-input-107-e4e0eb1457fc> in split_by_city4(text, city_list)
17 if not result or result[-1][-1] in ['\r', '\n']: # start a new chunk if it's the first word or the previous word ends with newline character
18 result.append("")
---> 19 if result and result[-1][-1] not in ['\r', '\n']: # add a space to the last chunk if it doesn't end with newline character
20 result[-1] += " "
21 result[-1] += words[i] # add the current word to the last chunk
IndexError: string index out of range