getting instances and topic sequences of all document in mallet

Question

I'm working topic modeling with mallet library. My data set is in filePath path and csvIterator seems can read data because model.getData() has about 27000 rows that is equal to my dataset. I wrote a loop that print instances and topic sequences of 10 first document, but size of tokens is 0. Where did I go wrong?

in the following, I want to show top 5 words in topics with proportions for the 10 first document, but all outputs are the same.

example of out in cosole:

---- document 0

0 0.200 com (1723) twitter (1225) http (871) cbr (688) canberra (626)

1 0.200 com (981) twitter (901) day (205) may (159) wed (156)

2 0.200 twitter (1068) com (947) act (433) actvcc (317) canberra (302)

3 0.200 http (1039) canberra (841) jobs (378) dlvr (313) com (228)

4 0.200 com (1185) www (1074) http (831) news (708) canberratimes (560)

---- document 1

0 0.200 com (1723) twitter (1225) http (871) cbr (688) canberra (626)

1 0.200 com (981) twitter (901) day (205) may (159) wed (156)

2 0.200 twitter (1068) com (947) act (433) actvcc (317) canberra (302)

3 0.200 http (1039) canberra (841) jobs (378) dlvr (313) com (228)

4 0.200 com (1185) www (1074) http (831) news (708) canberratimes (560)

as I know, LDA model generate each document and assigns them to words of topics. So why the results of each document are the same??

ArrayList<Pipe> pipeList = new ArrayList<Pipe>();
   pipeList.add(new CharSequenceLowercase());
    pipeList.add(new CharSequence2TokenSequence(Pattern.compile("\\p{L}[\\p{L}\\p{P}]+\\p{L}")));
    //stoplists/en.txt
    pipeList.add(new TokenSequenceRemoveStopwords(new File(pathStopWords), "UTF-8", false, false, false));
    pipeList.add(new TokenSequence2FeatureSequence());

    InstanceList instances = new InstanceList(new SerialPipes(pipeList));

    Reader fileReader = new InputStreamReader(new FileInputStream(new File(filePath)), "UTF-8");
//header of my data set
// row,location,username,hashtaghs,text,retweets,date,favorites,numberOfComment
    CsvIterator csvIterator = new CsvIterator(fileReader,
            Pattern.compile("^(\\d+)[,]*[^,]*[,]*[^,]*[,]*[^,]*[,]*([^,]*)[,]*[^,]*[,]*[^,]*[,]*[^,]*[,]*[^,]*$"),
            2, 0, 1);
    instances.addThruPipe(csvIterator); // data, label, name fields

    int numTopics = 5;
    ParallelTopicModel model = new ParallelTopicModel(numTopics, 1.0, 0.01);

    model.addInstances(instances);

    model.setNumThreads(2);


    model.setNumIterations(50);
    model.estimate();

    Alphabet dataAlphabet = instances.getDataAlphabet();
    ArrayList<TopicAssignment> arrayTopics = model.getData();

    for (int i = 0; i < 10; i++) {
        System.out.println("---- document " + i);
        FeatureSequence tokens = (FeatureSequence) model.getData().get(i).instance.getData();
        LabelSequence topics = model.getData().get(i).topicSequence;

        Formatter out = new Formatter(new StringBuilder(), Locale.US);
        for (int position = 0; position < tokens.getLength(); position++) {
            out.format("%s-%d ", dataAlphabet.lookupObject(tokens.getIndexAtPosition(position)),
                    topics.getIndexAtPosition(position));
        }
        System.out.println(out);

        double[] topicDistribution = model.getTopicProbabilities(i);

        ArrayList<TreeSet<IDSorter>> topicSortedWords = model.getSortedWords();


        for (int topic = 0; topic < numTopics; topic++) {
            Iterator<IDSorter> iterator = topicSortedWords.get(topic).iterator();
            out = new Formatter(new StringBuilder(), Locale.US);
            out.format("%d\t%.3f\t", topic, topicDistribution[topic]);
            int rank = 0;
            while (iterator.hasNext() && rank < 5) {
                IDSorter idCountPair = iterator.next();
                out.format("%s (%.0f) ", dataAlphabet.lookupObject(idCountPair.getID()), idCountPair.getWeight());
                rank++;
            }
            System.out.println(out);
        }

        StringBuilder topicZeroText = new StringBuilder();
        Iterator<IDSorter> iterator = topicSortedWords.get(0).iterator();

        int rank = 0;
        while (iterator.hasNext() && rank < 5) {
            IDSorter idCountPair = iterator.next();
            topicZeroText.append(dataAlphabet.lookupObject(idCountPair.getID()) + " ");
            rank++;
        }

    }

score 2 · Answer 1 · answered Oct 22 '17 at 21:02

2

The topics are defined at the level of the model, not at the level of the documents. They should be the same for all.

It looks like all your text is URLs. Adding a PrintInputPipe to your import sequence might help debugging.

answered Oct 22 '17 at 21:02

David Mimno

1,836
7
7

getting instances and topic sequences of all document in mallet

1 Answers1