I would like to count unrelated words in an article but I have troubles with grouping words of the same meaning derived from one another.
For instance, I would like gasoline
and gas
to be treated as the same token in sentences like The price of gasoline has risen.
and "Gas" is a colloquial form of the word gasoline in North American English. Conversely, in BE the term would be "petrol".
Therefore, if these two sentences comprised the entire article, the count for gas
(or gasoline
) would be 3 (petrol
would not be counted).
I have tried using NLTK's stemmers and lemmatizers but to no avail. Most seem to reproduce gas
as gas
and gasoline
as gasolin
which is not helpful for my purposes at all. I understand that this is the usual behaviour. I have checked out a thread that seems to be a little bit similar, however the answers there are not completely applicable to my case as I require the words to be derived from one another.
How to treat derived words of the same meaning as same tokens in order to count them together?