CoreNLP Sentiment training data in wrong format

Question

I'm trying to train my own sentiment analysis model for corenlp. I want to do this in java code (not from the command line), so I copied pieces from https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/sentiment/BuildBinarizedDataset.java to prepare the data, and then copying some pieces from https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/sentiment/SentimentTraining.java to do the actual training. I condensed the code of the former link, lines 171-226 a bit in my own code (to understand what's going on), into the following:

String text = IOUtils.slurpFileNoExceptions(inputPath);
    String[] chunks = text.split("\\n\\s*\\n+"); // need blank line to
    for (String chunk : chunks) {
        if (chunk.trim().isEmpty()) {
            continue;
        }
        String[] lines = chunk.trim().split("\\n");
        String sentence = lines[0];
        StringReader sin = new StringReader(sentence);
        DocumentPreprocessor document = new DocumentPreprocessor(sin);
        document.setSentenceFinalPuncWords(new String[] { "\n" });
        List<HasWord> tokens = document.iterator().next();
        Integer mainLabel = new Integer(tokens.get(0).word());
        tokens = tokens.subList(1, tokens.size());
        Map<Pair<Integer, Integer>, String> spanToLabels = Generics.newHashMap();
        for (int i = 1; i < lines.length; ++i) {
            extractLabels(spanToLabels, tokens, lines[i]);
        }
        Tree tree = parser.apply(tokens);
        Tree binarized = binarizer.transformTree(tree);
        Tree collapsedUnary = transformer.transformTree(binarized);
        if (sentimentModel != null) {
            Trees.convertToCoreLabels(collapsedUnary);
            SentimentCostAndGradient scorer = new SentimentCostAndGradient(sentimentModel, null);
            scorer.forwardPropagateTree(collapsedUnary);
            setPredictedLabels(collapsedUnary);
        } else {
            setUnknownLabels(collapsedUnary, mainLabel);
        }
        Trees.convertToCoreLabels(collapsedUnary);
        collapsedUnary.indexSpans();
        for (Map.Entry<Pair<Integer, Integer>, String> pairStringEntry : spanToLabels.entrySet()) {
            setSpanLabel(collapsedUnary, pairStringEntry.getKey(), pairStringEntry.getValue());
        }

        //trainingTrees.add(collapsedUnary);
        System.out.println("Debugging collaped Unary:" + collapsedUnary);
    }

The println gives me something like:

> Debugging collaped Unary:(ROOT (NP (DT The) (NNS performances)) (@S (VP (VBP are) (ADJP (RB uniformly) (JJ good))) (. .)))

Whereas, from what I understand, it is supposed to look like this (as for the format, sorry for copying another sentence here)):

(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2

As explained in https://mailman.stanford.edu/pipermail/java-nlp-user/2013-November/004308.html , stanford corenlp sentiment training set , How to train the Stanford NLP Sentiment Analysis tool , etc.

Nothing happens after these lines in BuildBinarizedDataset. Can someone tell me how to get it into the right format? (hacking something together myself feels quite stupid here, and there must be something I'm missing.)

i.e. the error I get later on, in SentimentTraining, is:

Exception in thread "main" java.lang.NumberFormatException: For input string: "DT"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.valueOf(Integer.java:766)
at edu.stanford.nlp.sentiment.SentimentUtils.attachLabels(SentimentUtils.java:37)
at edu.stanford.nlp.sentiment.SentimentUtils.attachLabels(SentimentUtils.java:33)
at edu.stanford.nlp.sentiment.SentimentUtils.attachLabels(SentimentUtils.java:33)
at edu.stanford.nlp.sentiment.SentimentUtils.readTreesWithLabels(SentimentUtils.java:69)
at edu.stanford.nlp.sentiment.SentimentUtils.readTreesWithGoldLabels(SentimentUtils.java:50)
at de.dkt.eservices.esentimentanalysis.modules.CoreNLPSentimentAnalyzer.trainModel(CoreNLPSentimentAnalyzer.java:251)
at de.dkt.eservices.esentimentanalysis.modules.CoreNLPSentimentAnalyzer.main(CoreNLPSentimentAnalyzer.java:306)

which makes sense, given that it expects a number, but gets the label of the node in the tree...

Would be grateful for any pointers here!

score 0 · Answer 1 · answered Jul 11 '17 at 09:43

Haven't found a real solution, but in case someone else runs into this problem, the following did the trick:

public static Tree traverseTreeAndChangePosTagsToNumbers(Tree tree) {

    for (Tree subtree : tree.getChildrenAsList()) {
        if (subtree.label().toString().matches("\\D+")) { 
            subtree.label().setValue("2");

        }if (Integer.parseInt(subtree.label().toString())<0||Integer.parseInt(subtree.label().toString())>4){
            subtree.label().setValue("2");
        }
        if (!(subtree.isPreTerminal())) {
            traverseTreeAndChangePosTagsToNumbers(subtree);
        }
    }

    return tree;
}

Not really a decent solution, since it does not acknowledge the option to provide scope for sentiment (i.e. annotating subphrases in the tree, as the number for subphrases is always 2 (neutral)), so sentiment is always based on the value for the whole sentence/tree, but at least it gets rid of the syntax error.

CoreNLP Sentiment training data in wrong format

1 Answers1