1

I would like to count unrelated words in an article but I have troubles with grouping words of the same meaning derived from one another.

For instance, I would like gasoline and gas to be treated as the same token in sentences like The price of gasoline has risen. and "Gas" is a colloquial form of the word gasoline in North American English. Conversely, in BE the term would be "petrol". Therefore, if these two sentences comprised the entire article, the count for gas (or gasoline) would be 3 (petrol would not be counted).

I have tried using NLTK's stemmers and lemmatizers but to no avail. Most seem to reproduce gas as gas and gasoline as gasolin which is not helpful for my purposes at all. I understand that this is the usual behaviour. I have checked out a thread that seems to be a little bit similar, however the answers there are not completely applicable to my case as I require the words to be derived from one another.

How to treat derived words of the same meaning as same tokens in order to count them together?

Arn
  • 1,898
  • 12
  • 26

1 Answers1

2

I propose a two steps approach:

First, find synonyms by comparing word embeddings (only non-stopwords). This should remove similar written words, which mean something else, such as gasolineand gaseous.

Then, check if synonyms share some of their stem. Essentially if "gas" is in "gasolin" and the other way around. This shall suffice because you only compare your synonyms.

import spacy
import itertools
from nltk.stem.porter import *
threshold = 0.6

#compare the stems of the synonyms
stemmer = PorterStemmer()
def compare_stems(a, b):
  if stemmer.stem(a) in stemmer.stem(b):
    return True
  if stemmer.stem(b) in stemmer.stem(a):
    return True
  return False

candidate_synonyms = {}
#add a candidate to the candidate dictionary of sets
def add_to_synonym_dict(a,b):
  if a not in candidate_synonyms:
    if b not in candidate_synonyms:
      candidate_synonyms[a] = {a, b}
      return
    a, b = b,a
  candidate_synonyms[a].add(b)

nlp = spacy.load('en_core_web_lg') 

text = u'The price of gasoline has risen. "Gas" is a colloquial form of the word gasoline in North American English. Conversely in BE the term would be petrol. A gaseous state has nothing to do with oil.'

words = nlp(text)

#compare every word with every other word, if they are similar
for a, b in itertools.combinations(words, 2):
  #check if one of the word pairs are stopwords or punctuation
  if a.is_stop or b.is_stop or a.is_punct or b.is_punct:
    continue
  if a.similarity(b) > threshold:
    if compare_stems(a.text.lower(), b.text.lower()):
      add_to_synonym_dict(a.text.lower(), b.text.lower())



print(candidate_synonyms)
#output: {'gasoline': {'gas', 'gasoline'}}

Then you can count your synonym candidates based on their appearances in the text.

Note: I chose the threshold for synonyms with 0.6 by chance. You would probably test which threshold suits your task. Also my code is just a quick and dirty example, this could be done a lot cleaner. `

chefhose
  • 2,399
  • 1
  • 21
  • 32