1

I create from a list of strings a list of instances consisting of token feature sequences. Via command line, I can prune those data based on counts, tf-idf etc. (https://github.com/mimno/Mallet/blob/master/src/cc/mallet/classify/tui/Vectors2Vectors.java). But what if I want to do it in Java? How do I have to extend my code?

My target is to remove most common words for LDA topic modeling.

public static InstanceList createInstanceList(List<String> texts) {

    ArrayList<Pipe> pipes = new ArrayList<Pipe>();

    pipes.add(new CharSequence2TokenSequence());
    pipes.add(new TokenSequenceLowercase());
    pipes.add(new TokenSequenceRemoveStopwords());
    pipes.add(new TokenSequence2FeatureSequence());

    InstanceList instanceList = new InstanceList(new SerialPipes(pipes));

    instanceList.addThruPipe(new ArrayIterator(texts));
    return instanceList;
}

Thank you in advance for your help!

Joker3139
  • 101
  • 3
  • 9

1 Answers1

2

Look at the code that you linked to for examples, starting around line 125. The FeatureCountTool generates term frequency and document frequency information. You can then generate a pruned alphabet and construct a new instance list, as in Vectors2Vectors, or generate a new stoplist Set and reimport the documents from the source files.

David Mimno
  • 1,836
  • 7
  • 7