Remove most common words mallet

Question

I create from a list of strings a list of instances consisting of token feature sequences. Via command line, I can prune those data based on counts, tf-idf etc. (https://github.com/mimno/Mallet/blob/master/src/cc/mallet/classify/tui/Vectors2Vectors.java). But what if I want to do it in Java? How do I have to extend my code?

My target is to remove most common words for LDA topic modeling.

public static InstanceList createInstanceList(List<String> texts) {

    ArrayList<Pipe> pipes = new ArrayList<Pipe>();

    pipes.add(new CharSequence2TokenSequence());
    pipes.add(new TokenSequenceLowercase());
    pipes.add(new TokenSequenceRemoveStopwords());
    pipes.add(new TokenSequence2FeatureSequence());

    InstanceList instanceList = new InstanceList(new SerialPipes(pipes));

    instanceList.addThruPipe(new ArrayIterator(texts));
    return instanceList;
}

Thank you in advance for your help!

score 2 · Accepted Answer · answered Mar 04 '18 at 17:17

2

Look at the code that you linked to for examples, starting around line 125. The FeatureCountTool generates term frequency and document frequency information. You can then generate a pruned alphabet and construct a new instance list, as in Vectors2Vectors, or generate a new stoplist Set and reimport the documents from the source files.

answered Mar 04 '18 at 17:17

David Mimno

1,836
7
7

Thanks! I thought there might be a class available to prune, but then this is the way to go. – Joker3139 Mar 04 '18 at 19:01

Remove most common words mallet

1 Answers1