I create from a list of strings a list of instances consisting of token feature sequences. Via command line, I can prune those data based on counts, tf-idf etc. (https://github.com/mimno/Mallet/blob/master/src/cc/mallet/classify/tui/Vectors2Vectors.java). But what if I want to do it in Java? How do I have to extend my code?
My target is to remove most common words for LDA topic modeling.
public static InstanceList createInstanceList(List<String> texts) {
ArrayList<Pipe> pipes = new ArrayList<Pipe>();
pipes.add(new CharSequence2TokenSequence());
pipes.add(new TokenSequenceLowercase());
pipes.add(new TokenSequenceRemoveStopwords());
pipes.add(new TokenSequence2FeatureSequence());
InstanceList instanceList = new InstanceList(new SerialPipes(pipes));
instanceList.addThruPipe(new ArrayIterator(texts));
return instanceList;
}
Thank you in advance for your help!