2

I'm new with Mallet and topic modeling in the field of art history. I'm working with Mallet 2.0.8 and command line (I don't know yet Java). I'd like to remove most common and least common words (10 times in the whole corpus, as D. Mimno recommend) before training the model because the results aren't clean (even with the stoplist), which is not surprising.

I've found that prune command could be usefull, with options like prune-document-freq. Is it right? Or does it exist another way? Someone could explain me the whole procedure in details (for example: create/input Vectors2Vectors file and at which stage and then?)? It would be much appreciated!

I'm sorry for this question, I'm a beginner with Mallet and text mining! But it's quite exciting!

Thanks a lot for your help!

Eugenie
  • 21
  • 3

1 Answers1

1

There are two places you can use Mallet to curate the vocabulary. The first is in data import, for example the import-file command. The --remove-stopwords option removes a fixed set of English stopwords. This is here for backwards compatibility reasons, and is probably not a bad idea for some English-language prose, but you can generally do better by creating a custom lists. I would recommend using instead the --stoplist-file option along with the name of a file. All words in this file, separated by spaces and/or newlines, will be removed. (Using both options will remove the union of the two lists, probably not what you want.) Another useful option is --replacement-files, which allows you to specify multi-word strings to treat as single words. For example, this file:

black hole
white dwarf

will convert "black hole" into "black_hole". Here newlines are treated differently from spaces. You can also specify multi-word stopwords with --deletion-files.

Once you have a Mallet file, you can modify that file with the prune command. --prune-count N will remove words that occur fewer than N times in any document. --prune-document-freq N will remove words that occur at least once in N documents. This version can be more robust against words that occur a lot in one document. You can also prune by proportion: --min-idf removes infrequent words, --max-idf removes frequent words. A word with IDF 10.0 occurs less than once in 20000 documents, a word with IDF below 2.0 occurs in more than 13% of the collection.

David Mimno
  • 1,836
  • 7
  • 7