2

I have a list that contains a lot of tagged bigrams. Some of the bigrams are not tagged correctly so I want to remove them from the master list. One of the words of a bigrams keeps repeating frequently, so I can remove the bigram if it contains an xyz word. Psudo example is below:

master_list = ['this is', 'is a', 'a sample', 'sample word', 'sample text', 'this book', 'a car', 'literary text', 'new book', 'them about', 'on the' , 'in that', 'tagged corpus', 'on top', 'a car', 'an orange', 'the book', 'them what', 'then how']

unwanted_words = ['this', 'is', 'a', 'on', 'in', 'an', 'the', 'them']

new_list = [item for item in master_list if not [x for x in unwanted_words] in item]

I can remove the items separately, i.e. every time I create a list and remove the items which contain the word, say, 'on'. This is tedious and it will require hours of filtering and creating new lists for filtering each unwanted word. I thought that a loop will help. However, I get the following type error:

Traceback (most recent call last):
File "<pyshell#21>", line 1, in <module>
new_list = [item for item in master_list if not [x for x in  unwanted_words] in item]
File "<pyshell#21>", line 1, in <listcomp>
new_list = [item for item in master_list if not [x for x in unwanted_words] in item]
TypeError: 'in <string>' requires string as left operand, not list

Your help is highly appreciated!

Mohammed
  • 1,364
  • 5
  • 16
  • 32

1 Answers1

2

Your condition if not [x for x in unwanted_words] in item is the same as if not unwanted_words in item, i.e. you are checking whether the list is contained in the string.

Instead, you can use any to check whether any of the parts of the bigram is in unwanted_words. Also, you could make unwanted_words a set to speed up the lookup.

>>> master_list = ['this is', 'is a', 'a sample', 'sample word', 'sample text', 'this book', 'a car', 'literary text', 'new book', 'them about', 'on the' , 'in that', 'tagged corpus', 'on top', 'a car', 'an orange', 'the book', 'them what', 'then how']
>>> unwanted_words = set(['this', 'is', 'a', 'on', 'in', 'an', 'the', 'them'])
>>> [item for item in master_list if not any(x in unwanted_words for x in item.split())]
['sample word', 'sample text', 'literary text', 'new book', 'tagged corpus', 'then how']
tobias_k
  • 81,265
  • 12
  • 120
  • 179