1

I've been working with Stanford's coreNLP to perform sentiment analysis on some data I have and I'm working on creating a training model. I know we can create a training model with the following command:

java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath     dev.txt -train -model model.ser.gz

I know what goes in the train.txt file. You score sentences and put them in train.txt, something like this: (0 (2 Today) (0 (0 (2 is) (0 (2 a) (0 (0 bad) (2 day)))) (..)))

But I don't understand what goes in the dev.txt file. I read through this question several times to try to understand what goes in dev.txt, but it's still unclear to me. Also, scoring these sentences manually has become a pain, is there a tool available that makes it easier? I'm worried that I've been using the wrong number of parentheses or some other stupid mistake like that.

Also, any suggestions on how long my train.txt file should be? I'm thinking of scoring a 1000 sentences. Is that number too small, too large?

All your help is appreciated :)

Community
  • 1
  • 1
user3266259
  • 369
  • 3
  • 8
  • 22

2 Answers2

1
  1. dev.txt should be the same as train.txt just with a different set of sentences. Note that the same sentence should not appear in dev.txt and train.txt. The development set is used to evaluate the quality of the model you train on the training data.

  2. We don't distribute a tool for tagging sentiment data. This class could be helpful in building data: http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/sentiment/BuildBinarizedDataset.html

  3. Here are the sizes of the train, dev, and test sets used for the sentiment model: train=8544, dev=1101, test=2210

StanfordNLPHelp
  • 8,699
  • 1
  • 11
  • 9
  • Can you elaborate on dev.txt? Right now, I'm using tweets for my train.txt file. Should I collect the same number of tweets, score them and then put them in my dev.txt file? – user3266259 Nov 15 '15 at 05:19
  • Also, once I've created my model, how can I test it? Is there a jar file in the coreNLP library that I downloaded that I can run on a sample test.txt file? I apologize for asking so many questions at you all at once, but you seem to be the expert :D – user3266259 Nov 15 '15 at 05:23
  • I was in error in my answer. From the paper: The sentences in the treebank were split into a train (8544), dev (1101) and test splits (2210) – StanfordNLPHelp Nov 15 '15 at 10:02
  • Yes dev.txt should be the same type of data, but just different examples. – StanfordNLPHelp Nov 15 '15 at 10:04
  • When you run SentimentTraining's main() method it will report scores on the dev set – StanfordNLPHelp Nov 15 '15 at 10:04
  • Why is the number of sentences in dev.txt a lot less than the number in train.txt? Shouldn't they be the same? If I have 500 tweets in train.txt, how many should I have in dev.txt? – user3266259 Nov 15 '15 at 22:08
  • You want to have as much training data as possible and a reasonable amount of evaluation data. It is typical in a project like this to have that type of size disparity. – StanfordNLPHelp Nov 16 '15 at 20:23
  • In a previous project, I saw someone had the data in train.txt to be the same as the data in their dev.txt. From what you've been saying, this would be incorrect, but I'm wondering how much of an impact that would cause? Would the model be completely useless if this were the case? @StanfordNLPHelp – user3266259 Nov 17 '15 at 06:30
  • I would recommend trying to recreate the training, test, dev split suggested above. Here is an article on test sets: https://en.wikipedia.org/wiki/Test_set – StanfordNLPHelp Nov 17 '15 at 06:38
1

Here is some sample code for evaluating a model

// load a model
SentimentModel model = SentimentModel.loadSerialized(modelPath);

// load devTrees
List<Tree> devTrees;
devTrees = SentimentUtils.readTreesWithGoldLabels(devPath);

// evaluate on devTrees
Evaluate eval = new Evaluate(model);
eval.eval(devTrees);
eval.printSummary();

You can find what you need to import, etc... by looking at:

edu/stanford/nlp/sentiment/SentimentTraining.java

StanfordNLPHelp
  • 8,699
  • 1
  • 11
  • 9