0

I'm working on a project about dependency parsing for Polish. We’re trying to train the Stanford Neural Network Dependency Parser on data from Polish language (using Universal Dependencies treebanks in .conllu format). The data is already tokenized and annotated, so we’ve trained neither the tokenizer, nor the parser provided by CORE NLP. So far we’ve been able to achieve some success with pl_lfg-ud Treebank in standard dependencies, by running the parser from the command line. But we would also like to train the parser to reproduce the enhanced Universal Dependencies, which are represented in the treebank as well. So far I have not been able to find a way to do so in the documentation, and FAQ for both NNDEP and CORE NLP, even though, as far as I understand, it is possible with Stanford NLP parser. Is it the case that the enhanced dependencies parsing works only for English (or other officially supported langauges), or am I simply doing something wrong?

I'll be very grateful for any clues!

1 Answers1

1

There is info here about how to train a model:

https://stanfordnlp.github.io/CoreNLP/depparse.html

example command:

java -Xmx12g edu.stanford.nlp.parser.nndep.DependencyParser -trainFile fr-ud-train.conllu -devFile fr-ud-dev.conllu -model new-french-UD-model.txt.gz -embedFile wiki.fr.vec -embeddingSize 300 -tlp edu.stanford.nlp.trees.international.french.FrenchTreebankLanguagePack -cPOS

You will need to train a part-of-speech model as well:

https://nlp.stanford.edu/software/pos-tagger-faq.html

https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/tagger/maxent/MaxentTagger.html

example command:

java -mx1g edu.stanford.nlp.tagger.maxent.MaxentTagger -props myPropertiesFile.props 

You can find in the documentation the appropriate style of training file.

Example file:


## tagger training invoked at Sun Sep 23 19:24:37 PST 2018 with arguments:
                   model = english-left3words-distsim.tagger
                    arch = left3words,naacl2003unknowns,wordshapes(-1,1),distsim(/u/nlp/data/pos_tags_are_useless/egw4-reut.512.clusters,-1,1),distsimconjunction(/u/nlp/data/pos_tags_are_useless/egw4-reut.512.clusters,-1,1)
            wordFunction = edu.stanford.nlp.process.AmericanizeFunction
               trainFile = /path/to/training-data
         closedClassTags = 
 closedClassTagThreshold = 40
 curWordMinFeatureThresh = 2
                   debug = false
             debugPrefix = 
            tagSeparator = _
                encoding = UTF-8
              iterations = 100
                    lang = english
    learnClosedClassTags = false
        minFeatureThresh = 2
           openClassTags = 
rareWordMinFeatureThresh = 10
          rareWordThresh = 5
                  search = owlqn
                    sgml = false
            sigmaSquared = 0.0
                   regL1 = 0.75
               tagInside = 
                tokenize = true
        tokenizerFactory = 
        tokenizerOptions = 
                 verbose = false
          verboseResults = true
    veryCommonWordThresh = 250
                xmlInput = 
              outputFile = 
            outputFormat = slashTags
     outputFormatOptions = 
                nthreads = 1

There is an exhaustive list of example training properties files here:

https://github.com/stanfordnlp/CoreNLP/tree/master/scripts/pos-tagger

If you use the Java pipeline, you'll need to write a tokenizer or provide text pre-tokenized.

You might be interested in our Python project which has a Polish model for tokenizing, sentence splitting, lemmatizing, and dependency parsing. Also you can train your own model:

https://github.com/stanfordnlp/stanfordnlp

StanfordNLPHelp
  • 8,699
  • 1
  • 11
  • 9