I'm new with Mallet and topic modeling in the field of art history. I'm working with Mallet 2.0.8 and command line (I don't know yet Java). I'd like to remove most common and least common words (10 times in the whole corpus, as D. Mimno recommend) before training the model because the results aren't clean (even with the stoplist), which is not surprising.
I've found that prune command could be usefull, with options like prune-document-freq. Is it right? Or does it exist another way? Someone could explain me the whole procedure in details (for example: create/input Vectors2Vectors file and at which stage and then?)? It would be much appreciated!
I'm sorry for this question, I'm a beginner with Mallet and text mining! But it's quite exciting!
Thanks a lot for your help!