I'm trying to train my own sentiment analysis model for corenlp. I want to do this in java code (not from the command line), so I copied pieces from https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/sentiment/BuildBinarizedDataset.java to prepare the data, and then copying some pieces from https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/sentiment/SentimentTraining.java to do the actual training. I condensed the code of the former link, lines 171-226 a bit in my own code (to understand what's going on), into the following:
String text = IOUtils.slurpFileNoExceptions(inputPath);
String[] chunks = text.split("\\n\\s*\\n+"); // need blank line to
for (String chunk : chunks) {
if (chunk.trim().isEmpty()) {
continue;
}
String[] lines = chunk.trim().split("\\n");
String sentence = lines[0];
StringReader sin = new StringReader(sentence);
DocumentPreprocessor document = new DocumentPreprocessor(sin);
document.setSentenceFinalPuncWords(new String[] { "\n" });
List<HasWord> tokens = document.iterator().next();
Integer mainLabel = new Integer(tokens.get(0).word());
tokens = tokens.subList(1, tokens.size());
Map<Pair<Integer, Integer>, String> spanToLabels = Generics.newHashMap();
for (int i = 1; i < lines.length; ++i) {
extractLabels(spanToLabels, tokens, lines[i]);
}
Tree tree = parser.apply(tokens);
Tree binarized = binarizer.transformTree(tree);
Tree collapsedUnary = transformer.transformTree(binarized);
if (sentimentModel != null) {
Trees.convertToCoreLabels(collapsedUnary);
SentimentCostAndGradient scorer = new SentimentCostAndGradient(sentimentModel, null);
scorer.forwardPropagateTree(collapsedUnary);
setPredictedLabels(collapsedUnary);
} else {
setUnknownLabels(collapsedUnary, mainLabel);
}
Trees.convertToCoreLabels(collapsedUnary);
collapsedUnary.indexSpans();
for (Map.Entry<Pair<Integer, Integer>, String> pairStringEntry : spanToLabels.entrySet()) {
setSpanLabel(collapsedUnary, pairStringEntry.getKey(), pairStringEntry.getValue());
}
//trainingTrees.add(collapsedUnary);
System.out.println("Debugging collaped Unary:" + collapsedUnary);
}
The println gives me something like:
> Debugging collaped Unary:(ROOT (NP (DT The) (NNS performances)) (@S (VP (VBP are) (ADJP (RB uniformly) (JJ good))) (. .)))
Whereas, from what I understand, it is supposed to look like this (as for the format, sorry for copying another sentence here)):
(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2
As explained in https://mailman.stanford.edu/pipermail/java-nlp-user/2013-November/004308.html , stanford corenlp sentiment training set , How to train the Stanford NLP Sentiment Analysis tool , etc.
Nothing happens after these lines in BuildBinarizedDataset. Can someone tell me how to get it into the right format? (hacking something together myself feels quite stupid here, and there must be something I'm missing.)
i.e. the error I get later on, in SentimentTraining, is:
Exception in thread "main" java.lang.NumberFormatException: For input string: "DT"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.valueOf(Integer.java:766)
at edu.stanford.nlp.sentiment.SentimentUtils.attachLabels(SentimentUtils.java:37)
at edu.stanford.nlp.sentiment.SentimentUtils.attachLabels(SentimentUtils.java:33)
at edu.stanford.nlp.sentiment.SentimentUtils.attachLabels(SentimentUtils.java:33)
at edu.stanford.nlp.sentiment.SentimentUtils.readTreesWithLabels(SentimentUtils.java:69)
at edu.stanford.nlp.sentiment.SentimentUtils.readTreesWithGoldLabels(SentimentUtils.java:50)
at de.dkt.eservices.esentimentanalysis.modules.CoreNLPSentimentAnalyzer.trainModel(CoreNLPSentimentAnalyzer.java:251)
at de.dkt.eservices.esentimentanalysis.modules.CoreNLPSentimentAnalyzer.main(CoreNLPSentimentAnalyzer.java:306)
which makes sense, given that it expects a number, but gets the label of the node in the tree...
Would be grateful for any pointers here!