Before you treat these bigrams as interchangeable, you have to make sure that they actually are. If they are not, it will reduce the quality of your analysis. 'foot_doctor' and 'doctor_foot' may not refer to the same thing - especially if you took other preprocessing steps, such as stemming or lemmatizing, i.e. turning 'the doctor's foot' into 'doctor foot'.
Assuming the meaning of these bigrams is interchangeable: Treat them as interchangeable - you can just rewrite one to be the other. Python offers a lot of built-in string functions. In your example, using replace()
, we can replace one bigram with another.
oldfakedoc = 'my landlord gave me a lease extension'
newfakedoc = oldfakedoc.replace('lease extension', 'extension lease')
print (newfakedoc)
gives my landlord gave me a extension lease
. Loop over all bigrams you want to replace, and then run your model.
You can use this approach also if you do not want to stem or lemmatize all of your documents, but have topics that load very heavily on words that are strongly related, such as "jump" and "jumping". Also, make sure you do not overwrite your raw data, so you can go back and reconstruct where these replacements were made, if you need to.