0

Hello Stackoverflow Community,

I am reaching out to you all for ideas on how to handle bigrams of the same word in a different sequence in topics modeling in python.

I have a topic model where two bigrams which mean the same are treated as different features because they are in different order. I need a way to have to treat those two bigrams as synonyms.

Ideas and suggestions are welcome.

Ex. ‘lease extension’ and ‘extension lease’ I want to treat them as the same word in a word matrix

Any type of suggestions and ideas are most welcome.

Thank you in advance, Nikhar

1 Answers1

1

Before you treat these bigrams as interchangeable, you have to make sure that they actually are. If they are not, it will reduce the quality of your analysis. 'foot_doctor' and 'doctor_foot' may not refer to the same thing - especially if you took other preprocessing steps, such as stemming or lemmatizing, i.e. turning 'the doctor's foot' into 'doctor foot'.

Assuming the meaning of these bigrams is interchangeable: Treat them as interchangeable - you can just rewrite one to be the other. Python offers a lot of built-in string functions. In your example, using replace(), we can replace one bigram with another.

oldfakedoc = 'my landlord gave me a lease extension'
newfakedoc = oldfakedoc.replace('lease extension', 'extension lease')
print (newfakedoc)

gives my landlord gave me a extension lease. Loop over all bigrams you want to replace, and then run your model.

You can use this approach also if you do not want to stem or lemmatize all of your documents, but have topics that load very heavily on words that are strongly related, such as "jump" and "jumping". Also, make sure you do not overwrite your raw data, so you can go back and reconstruct where these replacements were made, if you need to.

jhl
  • 671
  • 6
  • 23