Stanford Parser : frenchFactored.ser.gz

Question

I am using the Stanford Parser (Version 3.6.0) for French. My command line is

java -cp stanford-parser.jar:* edu.stanford.nlp.parser.lexparser.LexicalizedParser -maxlength 30 -outputFormat conll2007 frenchFactored.ser.gz test_french.txt > test_french.conll10

But I don't get the functions in the output, see :

1 Je _ CLS CLS _ 2 NULL _ _

2 mange _ V V _ 0 root _ _

3 des _ P P _ 2 NULL _ _

4 pommes _ N N _ 3 NULL _ _

5 . _ PUNC PUNC _ 2 NULL _ _

What could have I miss in the command line?

mejem · Answer 1 · 2016-03-04T16:26:41.917

0

There's nothing wrong with your command:

Known formats are: oneline, penn, latexTree, xmlTree, words, wordsAndTags, rootSymbolOnly, dependencies, typedDependencies, typedDependenciesCollapsed, collocations, semanticGraph, conllStyleDependencies, conll2007. The last two are both tab-separated values formats. The latter has a lot more columns filled with underscores. [...]

Source: http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/TreePrint.html

You can try another -outputFormat.

edited Mar 04 '16 at 16:26

answered Mar 04 '16 at 15:04

mejem

53
7

Thanks, for the Chinese parser (xinhuaFactored.ser.gz) I get the grammatical functions like nsubj, auxpass and so on, but with the French one as you see I only get "NULL", does that simply means function annotations are not available within Stanford Parser for French? – Starckman Mar 04 '16 at 15:18
It works also for English (I tried that now.) It seems that it is not implemented for French. So your command is fine, but the parser doesn't work as you expect. – mejem Mar 04 '16 at 16:22
ok, that's what I read here https://mailman.stanford.edu/pipermail/parser-user/2014-June/002937.html : "We don't (yet) have a direct dependency parser, instead parsing to constituencies first and then converting for English and Chinese. You would either need to convert parse trees for French dependencies in a similar manner or train and then use some other group's dependency parser. It's not impossible but it would be a ton of work." but because it is dated 2014 june, so I was not sure it was still the case. Thanks ! – Starckman Mar 04 '16 at 16:26

score 0 · Accepted Answer · answered Mar 06 '16 at 10:37

0

There is a deep learning based French dependency parser in Stanford CoreNLP 3.6.0.

Download Stanford CoreNLP 3.6.0 here:

http://stanfordnlp.github.io/CoreNLP/download.html

Also make sure to get the French models jar that is also available on that page.

And then run this command to use the French dependency parser, make sure to have the French models jar in your CLASSPATH:

java -Xmx6g -cp "*:stanford-corenlp-full-2015-12-09/*" edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-french.properties -file sample-french-document.txt -outputFormat text

answered Mar 06 '16 at 10:37

StanfordNLPHelp

8,699
1
11
9

Are those jar files present in the directory you run this command. You're getting this error because for some reason the french models jar is not in your CLASSPATH. If you do a jar -tf on the french models jar you will see that tagger file is present. – StanfordNLPHelp Mar 20 '17 at 21:34
They are not in the directory (french.tagger; UD_French.gz), I am not able to find them anywhere, except frenchFactored.ser.gz. – Starckman Mar 21 '17 at 00:59
You need to download the french models jar files from here: http://stanfordnlp.github.io/CoreNLP/download.html – StanfordNLPHelp Mar 21 '17 at 01:04
I downloaded it : stanford-french-corenlp-2016-10-31-models.jar. And it is in my directory "stanford-corenlp-full-2016-10-31" – Starckman Mar 21 '17 at 01:07
A warning, the French dependency parser has an issue that it was trained on a different kind of part of speech tag than our French part of speech tagger. This GitHub issue discusses it in more detail: https://github.com/stanfordnlp/CoreNLP/issues/312 ... the user in that issue wrote some code to convert our tags if you look at this issue: https://github.com/askplatypus/CoreNLP/commit/e6215bdc5d4903bc3e2d2fb533da7e3938fa825f – StanfordNLPHelp Mar 21 '17 at 01:07
Just to clarify, these jars need to be on the CLASSPATH: stanford-corenlp-3.7.0.jar, stanford-french-corenlp-2016-10-3‌1-models.jar ...if you issue the command "jar -tf stanford-french-corenlp-2016-10-3‌1-models.jar" you will see "edu/stanford/nlp/models/pos-tagger/french/french.tagger" is present in that jar file ... – StanfordNLPHelp Mar 21 '17 at 01:11
When I do "jar -tf stanford-french-corenlp-2016-10-3‌1-models.jar" I get "java.util.zip.ZipException: error in opening zip file" – Starckman Mar 21 '17 at 01:17
I guess you could try downloading the file again. What operating system are you using? When I download the file from that link I have no problem with the jar file. – StanfordNLPHelp Mar 21 '17 at 01:19
OS X El Capitan Version 10.11.6. – Starckman Mar 21 '17 at 01:22
The command "jar tf stanford-french-corenlp-2016-10-3‌1-models.jar" should work fine and show you a list of the resources in that jar. If you are getting an error you should try to download the file again because that error suggests the file is damaged in some manner. – StanfordNLPHelp Mar 21 '17 at 01:24
The easiest thing would be to 1.) clone (or just download) this repo: https://github.com/askplatypus/CoreNLP ...then follow the instructions on that page for building stanford-corenlp.jar with ant. You need to replace the jar that we distribute with a modified jar you build from askplatypus's code. Specifically you need to replace stanford-corenlp-3.7.0.jar with the jar you build by following the instructions on his page. – StanfordNLPHelp Mar 21 '17 at 01:33
So 1.) git clone https://github.com/askplatypus/CoreNLP.git 2.) follow instructions called "build with Ant" – StanfordNLPHelp Mar 21 '17 at 01:35
Thx for your priceless help, it works. Here is my command : java -mx2100m -cp stanford-ME_corenlp.jar:stanford-french-corenlp-2016-10-31-models.jar edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-french.properties -annotators tokenize,ssplit,pos,depparse -file /Users/xx/xx/xx/xx/xx/file.txt -outputFormat conll. However the options "-outputFormatOptions includePunctuationDependencies" and "-sentences newline" seems not to work, I don't find any mention of them in http://stanfordnlp.github.io/CoreNLP/cmdline.html – Starckman Mar 21 '17 at 12:46
To get 1 sentence per line, use this option "-ssplit.eolonly" which will say every line is a distinct sentence. – StanfordNLPHelp Mar 22 '17 at 00:41
When I run with the French (using the broken part-of-speech tags) I see a dependency edge for the punctuation in the CoNLL output format. I wanted to note that "-outputFormatOptions" is for a different Java class than the pipeline so the pipeline doesn't use that option. – StanfordNLPHelp Mar 22 '17 at 00:45
Please let me know if you think the French dependency parser with the fixed part-of-speech tags is working decently. This isn't something we've put a lot of time into developing but it would be interesting to know if it produces at least decent results vs. total nonsense. We are meaning to get around to training a UD based French part-of-speech tagger, but it might also be easier to use Thomas's code. – StanfordNLPHelp Mar 22 '17 at 00:54
-ssplit.eolonly works. It is decent, but I found numerous sentences with several root links, as if root was put instead of dep. The parser often doesn't parse the main verb (tagged correctly) as the root of the sentence, in such case it often puts the first word as the root. My text belongs to literature genre, the non recognition of MWEs often leads to parsing mistakes. The clitic object pronoun "l'" seems to be never parsed correctly, always been mistaken for the determinant "l'", even if this clitic is correctly tagged as PRONOUN... The treatment of noun phrase modifier is satisfying. – Starckman Mar 22 '17 at 02:12
The -ssplit.eolonly option doesn't work with the Chinese parser from CoreNLP...If I used the Stanford Parser instead of CoreNLP, is there any difference? – Starckman Mar 22 '17 at 10:52

score 0 · Answer 3 · answered Mar 10 '16 at 04:38

Your query is good, but the Stanford parser doesn't support this yet (version 3.6.0).

The following code prints "false" when using the french model. The command you are using checks for this internally and quietly avoids the analysis when false.

System.out.println(
  LexicalizedParser
    .loadModel("frenchFactored.ser.gz")
    .treebankLanguagePack()
    .supportsGrammaticalStructures()
);

That's why I'm using the Malt parser (http://www.maltparser.org/).

If you like the following output:

1   Je      Je      C   CLS     null    2   suj     _   _
2   mange   mange   V   V       null    0   root    _   _
3   des     des     P   P       null    2   mod     _   _
4   pommes  pommes  N   N       null    3   obj     _   _
5   .       .       P   PUNC    null    2   mod     _   _

Then use the following code that generates it (you can't simply use the command line). I'm using both Stanford and Malt to accomplish this:

LexicalizedParser lexParser = LexicalizedParser.loadModel("frenchFactored.ser.gz");
TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "");
ConcurrentMaltParserModel parserModel = ConcurrentMaltParserService.initializeParserModel(new File("fremalt-1.7.mco"));

Tokenizer<CoreLabel> tok = tokenizerFactory.getTokenizer(new StringReader("Je mange des pommes."));
List<CoreLabel> rawWords2 = tok.tokenize();
Tree parse = lexParser.apply(rawWords2);

// The malt parser requires token in the MaltTab format (Connll).
// Instead of using the Stanford tagger, we could have used Melt or another parser.
String[] tokens = parse.taggedLabeledYield().stream()
    .map(word -> {
        CoreLabel w = (CoreLabel)word;
        String lemma = Morphology.lemmatizeStatic(new WordTag(w.word(), w.tag())).word();
        String tag = w.value();

        return String.join("\t", new String[]{
            String.valueOf(w.index()+1),
            w.word(),
            lemma != null ? lemma : w.word(), 
            tag != null ? String.valueOf(tag.charAt(0)) : "_",
            tag != null ? tag : "_"
        });
    })
    .toArray(String[]::new);

ConcurrentDependencyGraph graph = parserModel.parse(tokens);
System.out.println(graph);

From there, you can programmatically traverse the graph by using:

graph.nTokenNodes()

If you use Maven, just add the following dependencies to your pom:

<dependency>
    <groupId>org.maltparser</groupId>
    <artifactId>maltparser</artifactId>
    <version>1.8.1</version>
</dependency>
<dependency>
    <groupId>edu.stanford.nlp</groupId>
    <artifactId>stanford-corenlp</artifactId>
    <version>3.6.0</version>
</dependency>

Bonus: the imports

import org.maltparser.concurrent.ConcurrentMaltParserModel;
import org.maltparser.concurrent.ConcurrentMaltParserService;
import org.maltparser.concurrent.graph.ConcurrentDependencyGraph;
import org.maltparser.concurrent.graph.ConcurrentDependencyNode;
import org.maltparser.core.exception.MaltChainedException;

import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.ling.WordTag;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;
import edu.stanford.nlp.process.CoreLabelTokenFactory;
import edu.stanford.nlp.process.Morphology;
import edu.stanford.nlp.process.PTBTokenizer;
import edu.stanford.nlp.process.Tokenizer;
import edu.stanford.nlp.process.TokenizerFactory;
import edu.stanford.nlp.trees.Tree;

Extra: fremalt-1.7.mco file

http://www.maltparser.org/mco/french_parser/fremalt.html

Sorry, I haven't been connected for a long time and didn't respond, thank you very much. I used the Mate Parser for French which I recommend https://code.google.com/archive/p/mate-tools/downloads — Starckman, Feb 23 '17 at 08:20

Stanford Parser : frenchFactored.ser.gz

3 Answers3