There is info here about how to train a model:
https://stanfordnlp.github.io/CoreNLP/depparse.html
example command:
java -Xmx12g edu.stanford.nlp.parser.nndep.DependencyParser -trainFile fr-ud-train.conllu -devFile fr-ud-dev.conllu -model new-french-UD-model.txt.gz -embedFile wiki.fr.vec -embeddingSize 300 -tlp edu.stanford.nlp.trees.international.french.FrenchTreebankLanguagePack -cPOS
You will need to train a part-of-speech model as well:
https://nlp.stanford.edu/software/pos-tagger-faq.html
https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/tagger/maxent/MaxentTagger.html
example command:
java -mx1g edu.stanford.nlp.tagger.maxent.MaxentTagger -props myPropertiesFile.props
You can find in the documentation the appropriate style of training file.
Example file:
## tagger training invoked at Sun Sep 23 19:24:37 PST 2018 with arguments:
model = english-left3words-distsim.tagger
arch = left3words,naacl2003unknowns,wordshapes(-1,1),distsim(/u/nlp/data/pos_tags_are_useless/egw4-reut.512.clusters,-1,1),distsimconjunction(/u/nlp/data/pos_tags_are_useless/egw4-reut.512.clusters,-1,1)
wordFunction = edu.stanford.nlp.process.AmericanizeFunction
trainFile = /path/to/training-data
closedClassTags =
closedClassTagThreshold = 40
curWordMinFeatureThresh = 2
debug = false
debugPrefix =
tagSeparator = _
encoding = UTF-8
iterations = 100
lang = english
learnClosedClassTags = false
minFeatureThresh = 2
openClassTags =
rareWordMinFeatureThresh = 10
rareWordThresh = 5
search = owlqn
sgml = false
sigmaSquared = 0.0
regL1 = 0.75
tagInside =
tokenize = true
tokenizerFactory =
tokenizerOptions =
verbose = false
verboseResults = true
veryCommonWordThresh = 250
xmlInput =
outputFile =
outputFormat = slashTags
outputFormatOptions =
nthreads = 1
There is an exhaustive list of example training properties files here:
https://github.com/stanfordnlp/CoreNLP/tree/master/scripts/pos-tagger
If you use the Java pipeline, you'll need to write a tokenizer or provide text pre-tokenized.
You might be interested in our Python project which has a Polish model for tokenizing, sentence splitting, lemmatizing, and dependency parsing. Also you can train your own model:
https://github.com/stanfordnlp/stanfordnlp