I am preparing to do topic modeling via Mallet and have finished pulling the raw datasets. Before I import and start modeling, I need to take some steps to clean and streamline the texts, of course. I have my lists of stopwwords ready and I know that I can remove punctuation, digits etc. easily with Excel. What I am a little fuzzy about is stemming and lemmatizing. Not on the concept itself but rather what the best approach would be.
To give a better overview, here is what I would like to do:
- standardize inconsistencies in spelling, e.g. topicmodeling -> topic modeling
- remove extra whitespaces from words, e.g. two whitespaces in a row
- stem and lemmatize
Based on experience, can anyone recommend the best approach to these three, especially to the last though? Is there an app that I can use for that?
Many thanks in advance!