1

From a list of bigrams, I need to redact bigrams that do not have at least one term that exactly matches at least one term in a list of unigrams.

The Two Lists

bigram_list = ['computer vision', 'data excellence', 'data visualization']

unigram_list = ['excel', 'tableau', 'visio', 'visualization']

The Objective

cleaned_bigrams = ['data visualization']

What I've Tried

I tried adapting this approach here, but failed: Removing separate list of items from another list in Python 3.x

I also tried this, but couldn't get it to work: Get rid of unigrams in a list if contained within bigrams or trigrams python

I tried to adapt from a previous question I asked, but couldn't get that going: Create new boolean fields based on specific bigrams appearing in a tokenized pandas dataframe

Thanks in advance for any help you can provide, and would appreciate an upvote if you think this is a good question!

Cary Cox
  • 139
  • 9

1 Answers1

1

Here is one way to do it:

bigram_list = ["computer vision", "data excellence", "data visualization"]
unigram_list = ["excel", "tableau", "visio", "visualization"]

# Init a dict for counting number of match
counts = {key: 0 for key in bigram_list}

# Count number of match for each bigram
for big in bigram_list:
    for uni in unigram_list:
        if uni in big.split(" "):
            counts[big] += 1

# Filter
cleaned_bigrams = [item for item in bigram_list if counts[item] > 0]
print(cleaned_bigrams)
# Output
['data visualization']
Laurent
  • 12,287
  • 7
  • 21
  • 37
  • 1
    Very artful how you did this, and it runs quickly for my dataset. Using the dictionary to count matches is clever. Much appreciated! – Cary Cox Jun 06 '22 at 14:35