1

I am new in NLP and I used Stanford NER tool to classify some random text to extract special keywords used in software programming.

The problem is, I don't no how to do changes to the classifiers and text annotators in Stanford NER to recognize software programming keywords. For example:

today Java used in different operating systems (Windows, Linux, ..)

the classification results should such as:

Java "Programming_Language"
Windows "Operating_System"
Linux "Operating_system"

Would you please help on how to customize the StanfordNER classifiers to satisfied my needs?

Frakcool
  • 10,915
  • 9
  • 50
  • 89
Tech
  • 77
  • 11

2 Answers2

6

I think it is quite well documented in Stanford NER faq section http://nlp.stanford.edu/software/crf-faq.shtml#a.

Here are the steps:

  • In your properties file change the map to specify how your training data is annotated (or structured)

map = word=0,myfeature=1,answer=2

  • In src\edu\stanford\nlp\sequences\SeqClassifierFlags.java

    Add a flag stating that you want to use your new feature, let's call it useMyFeature Below public boolean useLabelSource = false , Add public boolean useMyFeature= true;

    In same file in setProperties(Properties props, boolean printProps) method after else if (key.equalsIgnoreCase("useTrainLexicon")) { ..} tell tool, if this flag is on/off for you

    else if (key.equalsIgnoreCase("useMyFeature")) {
          useMyFeature= Boolean.parseBoolean(val);
    }
    
  • In src/edu/stanford/nlp/ling/CoreAnnotations.java, add following section

    public static class myfeature implements CoreAnnotation<String> {
      public Class<String> getType() {
        return String.class;
      }
    }
    
  • In src/edu/stanford/nlp/ling/AnnotationLookup.java in public enumKeyLookup{..} in bottom add

    MY_TAG(CoreAnnotations.myfeature.class,"myfeature")

  • In src\edu\stanford\nlp\ie\NERFeatureFactory.java, depending on the "type" of feature it is, add in

    protected Collection<String> featuresC(PaddedList<IN> cInfo, int loc)
    
    if(flags.useRahulPOSTAGS){
        featuresC.add(c.get(CoreAnnotations.myfeature.class)+"-my_tag");
    }
    

Debugging: In addition to this, there are methods which dump the features on file, use them to see how things are getting done under hood. Also, I think you would have to spend some time with debugger too :P

John Wiseman
  • 3,081
  • 1
  • 22
  • 31
Rahul
  • 132
  • 5
  • I have question about "useMyFeature", does it mean the featuers of the annotated word, for example: Java "Prog_Language" and the "Prog_Language" is the feature of word Java ?? – Tech Apr 29 '14 at 22:03
  • @user3247440 "useMyFeature" is a **flag** whethrr to use your feature for training or not. If you turn it on, then it will use "corresponding" feature. – Rahul Apr 29 '14 at 23:36
  • great, How about "-my_tag" what should i put in it? I have confusing with some terminology because I'm not expert in NLP field – Tech Apr 30 '14 at 02:10
  • would you please help, Also I could not find the main function to run the modified classes ?! – Tech Apr 30 '14 at 03:36
  • 1
    @user3247440 Read this link [link] (https://mailman.stanford.edu/pipermail/java-nlp-user/2011-December/001567.html) . Every thing will be crystal clear. You can run from here `src/edu/stanford/nlp/ie/crf/CRFClassifier.java`. I would suggest you to run exisiting CRF model under debug mode and see what is happening. – Rahul May 01 '14 at 23:00
  • Thank you very much, I got an error on if(flags.useRahulPOSTAGS){ ... } – Tech May 03 '14 at 14:59
0

Seems you want to train your custom NER model.

Here is a detailed tutorial with full code:

https://dataturks.com/blog/stanford-core-nlp-ner-training-java-example.php?s=so

Training data format

Training data is passed as a text file where each line is one word-label pair. Each word in the line should be labeled in a format like "word\tLABEL", the word and the label name is separated by a tab '\t'. For a text sentence, we should break it down into words and add one line for each word in the training file. To mark the start of the next line, we add an empty line in the training file.

Here is a sample of the input training file:

hp  Brand
spectre ModelName
x360    ModelName

home    Category
theater Category
system  0

horizon ModelName
zero    ModelName
dawn    ModelName
ps4 0

Depending upon your domain, you can build such a dataset either automatically or manually. Building such a dataset manually can be really painful, tools like a NER annotation tool can help make the process much easier.

Train model

public void trainAndWrite(String modelOutPath, String prop, String trainingFilepath) {
   Properties props = StringUtils.propFileToProperties(prop);
   props.setProperty("serializeTo", modelOutPath);

   //if input use that, else use from properties file.
   if (trainingFilepath != null) {
       props.setProperty("trainFile", trainingFilepath);
   }

   SeqClassifierFlags flags = new SeqClassifierFlags(props);
   CRFClassifier<CoreLabel> crf = new CRFClassifier<>(flags);
   crf.train();

   crf.serializeClassifier(modelOutPath);
}

Use the model to generate tags:

public void doTagging(CRFClassifier model, String input) {
    input = input.trim();
    System.out.println(input + "=>"  +  model.classifyToString(input));
}  

Hope this helps.

user439521
  • 670
  • 1
  • 6
  • 13