How to split string into chunks at keyword(s), while preserving spacing and conditions of what words to split at?

Question

I want to take an input string and split it up into chunks. The splits should occur when we hit a word in city_list (eg. city_list = ['Berlin']), and include the next four words (spaces and special characters don't contribute to the count but should be included). The important parameters are that words in city_list must always be at the start of a chunk; splits must only be made on spaces and not within words; all formatting must be preserved in the output (all spaces, punctuation, special characters, \r\n, etc.); and chunks with strings from cities_list in them must have at least 4 words in the chunk (other chunks can have fewer words if necessary).

For instance:

# Input string
test2 = '                                            Department of Medical Affairs\r\n                                            North Louisiana Health Care System\r\n                                            500 Lancaster Rd \r\n                                            Berlin, TX 7526\r\n'

# Desired output
output2 = [
  '                                            ',
  'Department of Medical Affairs\r\n',
  '                                            ',
  'North Louisiana Health Care System\r\n',
  '                                            ',
  '500 Lancaster Rd \r\n',
  '                                            ',
  'Berlin, TX 7526\r\n'
]

# Output of ''.join(output2): '                                            Department of Veterans Affairs\r\n                                            North Texas Health Care System\r\n                                            500 Lancaster Rd \r\n                                            Berlin, TX 7526\r\n'

# So test2 == ''.join(output2) yields True

The point of all this is to be able to look ahead after the city (eg., Berlin) to see if there is a state or other location indicator that is listed within a few words ahead of it and redact the city name (I already have a function that deals with that). I can't search the entire string for "Berlin" at once because I need to deal with each occurrence of that word on a case-by-case basis (eg., there could be a city named "Cornelius" and a person named "Cornelius", so chunking the string around the city keyword would help provide context for each case).

Here's an attempt; I keep getting an index error when I test it out. I understand why but don't know how to fix it.

def split_by_city(text, city_list):
    result = []
    words = text.split() # split the text into a list of words
    i = 0 # initialize a counter for iterating over the words
    while i < len(words): # loop over the words
        if words[i] in city_list: # if the word is in the city list
            j = i+1 # start a new counter to iterate over the next 4 words
            while j < len(words) and j-i < 5: # loop over the next 4 words or until the end of the text
                if words[j][-1] not in ['\r', '\n']: # exclude new line characters at the end of the word
                    if result and result[-1][-1] not in ['\r', '\n']: # add space between words only if previous word doesn't end with newline character
                        result[-1] += " " # add a space to the last chunk
                    result[-1] += words[j] # add the current word to the last chunk
                j += 1 # increment the counter for iterating over the next 4 words
            if result and result[-1][-1] not in ['\r', '\n']: # add a space to the last chunk if it doesn't end with newline character
                result[-1] += " "
        else: # if the word is not in the city list
            if not result or result[-1][-1] in ['\r', '\n']: # start a new chunk if it's the first word or the previous word ends with newline character
                result.append("")
            if result and result[-1][-1] not in ['\r', '\n']: # add a space to the last chunk if it doesn't end with newline character
                result[-1] += " "
            result[-1] += words[i] # add the current word to the last chunk
        i += 1 # increment the counter for iterating over the words
    return result # return the list of chunks

split_by_city(test2, ['Berlin'])


---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-111-4ee705e7672a> in <module>
----> 1 split_by_city4(test2, ['Berlin'])

<ipython-input-107-e4e0eb1457fc> in split_by_city4(text, city_list)
     17             if not result or result[-1][-1] in ['\r', '\n']: # start a new chunk if it's the first word or the previous word ends with newline character
     18                 result.append("")
---> 19             if result and result[-1][-1] not in ['\r', '\n']: # add a space to the last chunk if it doesn't end with newline character
     20                 result[-1] += " "
     21             result[-1] += words[i] # add the current word to the last chunk

IndexError: string index out of range

"How to split string into chunks of length n" - I can't understand how this part of the title relates to the question. In the example input and output, what is the value of `n`? What should happen, for example, if there is a different number of leading spaces? How does that change relate to the value of `n`? — Karl Knechtel, Mar 21 '23 at 22:50
"Here's an attempt; I keep getting an index error when I test it out. I understand why but don't know how to fix it." Well, what is your understanding of why? What is preventing you from fixing it, given that understanding? — Karl Knechtel, Mar 21 '23 at 22:51
(Hint: suppose that, when the code `if not result or result[-1][-1] in ['\r', '\n']:` is reached, `result` is an empty list. Would the condition be met? Would the code run? What would `result` be after that? Therefore, what would `result[-1][-1]` try to do? **Should the code** `if result and result[-1][-1] not in ['\r', '\n']:` be tried in this situation? Is it? Why? (Hint for the hint: are you familiar with `elif`?) — Karl Knechtel, Mar 21 '23 at 22:53
It is a bit hard to understand both the intention of your code, and your high-level goal by doing this. If you are just looking for context when you find a city name, why don't you just capture the index `i` of when that happens, and you then retrieve the slice `i : i+4`? — Rodrigo Rodrigues, Mar 21 '23 at 23:23
_Tomorrow I'll take the bus to Cornelius. My friend will be waiting for me at the station._ A chunk of `[Cornelius + next four words]` won't probably work here. You may want to try other tools. See [this](https://stackoverflow.com/a/36255377/15032126) for example. — Ignatius Reilly, Mar 21 '23 at 23:46
@RodrigoRodrigues -- i:i+4 is my current method! I've been looking for a workaround since my data is unstructured & sometimes I'll run into text like "The city is Lincoln \r\r\r \n\n , Nebraska" where using i:I+4 is tricky because of the strange spacing. I also tried this regex: r"(?<=\b" + city + r")\s?(,|\.|\s){1,100}(\w{1,100})?\s?[,\.\-]?(\s{1,100})?(\w{1,100})?(\s{1,100})?(\w{1,100})?" And it's working ok, but it's so granular that I worried that it would miss edge cases. That's why I was considering chunking instead, but might not be much better than i:i+4. but open to other options! — skaleidoscope, Mar 22 '23 at 00:07
@KarlKnechtel that makes sense - even when I initialize the list, I get the same error. I'm a bit deep into the code so losing perspective on what might be going wrong. Basically just want to turn the string into chunks with the first word being the city, with enough window to see if there's a state shortly after it even if there are lots of newlines and returns in between ("Berlin \r\r\r\r TX"). — skaleidoscope, Mar 22 '23 at 01:07
"I'm a bit deep into the code so losing perspective on what might be going wrong." - **Please** read [mre] and https://ericlippert.com/2014/03/05/how-to-debug-small-programs/. It is your responsibility, before posting, to try to locate a specific problem and know exactly what goes wrong first, at what step, which means that first you must have a clear expectation of what the program is supposed to do, step by step. Generally if you are struggling with getting code like this to work then **write much less code at a time** and **make sure each part works** before moving on. — Karl Knechtel, Mar 22 '23 at 09:03
The problem description still makes no sense to me at all. "The splits should occur when we hit a word in city_list (eg. city_list = ['Berlin'])" - but the "desired output" shows splits at other points, too. "and include the next four words (spaces and special characters don't contribute to the count but should be included)." - but `'North Louisiana Health Care System\r\n'` is five words, and `'500 Lancaster Rd \r\n'` is three words. Aside from that, the desired output doesn't look like it has **anything to do with** the words at all; it's only separating runs of spaces from the rest. — Karl Knechtel, Mar 22 '23 at 09:07
For example, if I simply wrote code that splits the text into lines, and then splits each line into the leading spaces and then everything after those spaces, I would get the apparently desired output for this example, *without ever even having to think about the contents of `city_list`.* — Karl Knechtel, Mar 22 '23 at 09:09
@KarlKnechtel I see what you mean, sorry for being unclear - please see my responses to VPK's answer for more. Yes, you're right that the split at Berlin is the most important part, and should have a window of 4 words after it (newlines/special chars would be included, but not counted towards the count of 4). The rest of the string would be split as well as a result, and I'm less concerned about length of chunks that don't have a city in them, as long as they don't omit chars. Eg, "The city is Berlin, \n is a nice place and blah" -> ["The city is ", "Berlin, \n is a nice place and", " blah"] — skaleidoscope, Mar 22 '23 at 15:11
"Eg, "The city is Berlin, \n is a nice place and blah" -> ["The city is ", "Berlin, \n is a nice place and", " blah"]" - in this example, how did you decide to split after `and`? — Karl Knechtel, Mar 22 '23 at 20:41

score 0 · Answer 1 · answered Mar 22 '23 at 00:22

I took a look at your code and you seem to have run into the pitfall of using result as a 2D array even though you have only defined it as a 1D array. At least that's what the error output is being generated by.

Here is a solution to what I think you are trying to go for:

def splitByCity(text,citylist):
    # split text into words
    wordlist = text.split()
    result = []
    # variable for keeping track of last chunk and a counter
    currentChunk = ""
    c = 0
    inWordList = True
    dontPrint = False
    while inWordList == True:
        for city in citylist:
            if (wordlist[c] == city):
                result.append(currentChunk)
                # Get next 4 words ONLY if there are words to spare
                if (len(wordlist) - c - 1) > 4:
                    currentChunk = ''
                    # Change below number from 4 to higher number to increase chunk length of found instances
                    for i in range(4):
                        currentChunk += wordlist[c + i] + " "
                    result.append(currentChunk)
                    c = c + 4
                    currentChunk = ''
                else:
                    dontPrint = True
                    currentChunk = wordlist[c] + ' '
                
        if (dontPrint == False):
            currentChunk += (wordlist[c] + ' ')
        else:
            dontPrint = False
        c = c + 1
        if (len(wordlist) - c) < 2:
            inWordList = False

    currentChunk += wordlist[c]
    result.append(currentChunk)
    
    return result

And here is a sample of the output code with two cities:

text = "Hi I am a highly suspicious official that has come in from Berlin to experience the harsh winters here, not in Moscow though I am a"
citylist = ["Berlin","Moscow"]
print(splitByCity(text,citylist))

With the output:

['Hi I am a highly suspicious official that has come in from ', 'Berlin to experience the ', 'harsh winters here, not in ', 'Moscow though I am a']

I think there might be slight discrepancies for your edge cases, but I think it should work regardless. The newline characters should be irrelevant to the code since you want them included as is.

thank you! I can definitely build off of this. When I try to run it on the test2 sample string in my post, I get: ['Department of Medical Affairs North Louisiana Health Care System 500 Lancaster Rd Berlin, TX 7526'] (rather than what's in output2). — skaleidoscope, Mar 22 '23 at 01:04
If you provide your working specifications a little more concisely/clearly I can adjust the code accordingly. You have not specified the city list in the example that you currently have. If Berlin is the only component of that city list I cannot understand why there are more than two chunks for your example. — VPK, Mar 22 '23 at 01:43
Sorry, I can see how that was unclear. You're right that in my example, there wouldn't appear to be a need to have more than two chunks- one for everything before "Berlin" and one for everything after Berlin, inclusive. My cutoff of 4 words after the city is because I figured that the wider the window, the more scope there would be for mistakes to be made in location tagging since we'd be getting farther away from the original city (so I'd prefer keeping the window small if possible). — skaleidoscope, Mar 22 '23 at 02:52
Splitting at Moscow does make sense though and I can build off of that if the window option doesn't work. I would just need all special characters, spacing, etc. preserved exactly into the chunks, so that when I join the chunks at the end it's identical to the input string. So \n, \r, and ' ' would need to be saved literally. In the case where there is a bunch of spacing after a city (eg "Berlin \r\r\n\n TX is a city"), we'd need to chunk Berlin along with the spacing and the next few words just like that, because \r\n aren't useful - we'd need to access TX to see if it's a state — skaleidoscope, Mar 22 '23 at 02:58

How to split string into chunks at keyword(s), while preserving spacing and conditions of what words to split at?

1 Answers1