5

I have a large dataset all_transcripts with almost 3 million rows. One of the columns msgText contains written messages.

>>> all_transcripts['msgText']

['this is my first message']
['second message is here']
['this is my third message']

Furthermore, I have a list with 200+ words, called gemeentes.

>>> gemeentes
['first','second','third' ... ]

If a word in this list is contained in msgText, I want to replace it by another word. To do so, I created the function:

def replaceCity(text):
    newText = text.replace(plaatsnaam, 'woonplaats')
    return str(newText)

So, my desired output would look like:

['this is my woonplaats message']
['woonplaats message is here']
['this is my woonplaats message']

Currently, I am looping through the list and for every item in my list, apply the replaceCityfunction.

for plaatsnaam in gemeentes:
    global(plaatsnaam)
    all_transcripts['filtered_text'] = test.msgText.apply(replaceCity)

However, this takes very long, so does not seem to be efficient. Is there a faster way to perform this task?


This post (Algorithm to find multiple string matches) is similar, however my problem is different because:

  • here there is only one big piece small of text, while I have a dataset with many different rows

  • I want to replace words, rather than just finding the words.

Emil
  • 1,531
  • 3
  • 22
  • 47

1 Answers1

5

Assuming all_transcripts is a pandas DataFrame:

all_transcripts['msgText'].str.replace('|'.join(gemeentes),'woonplaats')

Example:

all_transcripts = pd.DataFrame([['this is my first message'],
                                ['second message is here'],
                                ['this is my third message']],
                               columns=['msgText'])
gemeentes = ['first','second','third']

all_transcripts['msgText'].str.replace('|'.join(gemeentes),'woonplaats')

outputs

0    this is my woonplaats message
1       woonplaats message is here
2    this is my woonplaats message
Dan
  • 45,079
  • 17
  • 88
  • 157
  • 1
    Thanks! With this code it replaces each occurence of the word, even if it is contained in another word. Do you know if there is a way to do it for whole words only? E.g. if I had the sentences 'This is the firsttime for me' and 'What about a second time', It replaces both sentences: 'This is the woonplaatstime for me' and 'What about a woonplaats time'. However, I want to replace only the instances where the words are replaced that fully contain the word. So that the output would be: 'This is the firsstime for me' and 'What about a woonplaats time'. Do you know how to do this? – Emil May 01 '19 at 10:19
  • 2
    I have added spaces around my words in `gemeentes`, so now it works! – Emil May 01 '19 at 10:23
  • 2
    @emil pandas `.str.replace` accepts regex, so there might be a neater way to find a word. I would suggest maybe leaving `gemeentes` as it was but changing the join from `'|'.join(gemeentes)` to `pattern = r'\s*' + r'\s*|\s*'.join(gemeentes) + r'\s*' and then `.str.replace(pattern, 'woonplaats'). This way you'll hit all white space, not just single spaces. You'll probably also want to account for words ending in full stops, quesstion marks etc. But I'm sure if you search for identifying whole words in regex you'll find a good solution. – Dan May 01 '19 at 10:51
  • `.str.replace(pattern, 'woonplaats')` does not work. It still replaces parts of words and it now also replaces the spaces around the words. Shouldn't you use the regex word boundary? – Superdooperhero Jul 20 '19 at 22:39
  • `pattern = r'\b' + '|'.join(gemeentes) + r'\b'` worked for me. – Superdooperhero Jul 20 '19 at 22:49