Why is stanford corenlp gender identification nondeterministic?

Question

I have the following results and as you can see the name edward has different results (null and male). This has happened with several names.

edward, Gender: null
james, Gender: MALE
karla, Gender: null
edward, Gender: MALE

Additionally, how can I customize the gender dictionaries? I want to add Spanish and Chinese names.

How do you call CoreNLP (i.e.: How did you create the list in your post)? — SQL Police, Jul 07 '15 at 05:19
Could you provide more details about the input file and what commands you issued? Thanks! — StanfordNLPHelp, Jul 08 '15 at 19:46
loadStanfordCoreNLP(); Annotation document = new Annotation("edward james karla edward"); pipeline.annotate(document); for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) { for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) { System.out.print(token.value()); System.out.print(", Gender: "); System.out.println(token.get(MachineReadingAnnotations.GenderAnnotation.class)); } } — user3390236, Jul 09 '15 at 00:37

StanfordNLPHelp · Answer 1 · 2015-07-11T10:37:01.670

You have raised a lot of issues!

1.) Karla is not in the default gender mappings file, so that is why that's getting null

2.) If you want to make your own custom file, it should be in this format:

JOHN\tMALE

There should be one NAME\tGENDER entry per line

The GenderAnnotator can only take 1 file for the mappings, so you need to make a new file with the names you want added on.

The default gender mappings file is in the stanford-corenlp-3.5.2-models.jar file.

You can extract the default gender mappings file from that jar in this manner:

mkdir tmp-stanford-models-expanded
cp /path/of/stanford-corenlp-3.5.2-models.jar tmp-stanford-models-expanded
cd tmp-stanford-models-expanded
jar xf stanford-corenlp-3.5.2-models.jar
there should now be tmp-stanford-models-expanded/edu
the file you want is tmp-stanford-models-expanded/edu/stanford/nlp/models/gender/first_name_map_small

3.) Build your pipeline in this manner to use your custom gender dictionary:

Properties props = new Properties();
props.setProperty("annotators",
    "tokenize, ssplit, pos, lemma, gender, ner");
props.setProperty("gender.firstnames","/path/to/your/gender_dictionary.txt");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

4.) Try running gender BEFORE ner in your pipeline (see my ordering of the annotators above). It is possible for the RegexNERSequenceClassifier (which is the class that adds the Gender tags) to get blocked if tokens already have NER tags. It looks to me like running the gender annotator first will fix the problem. So when you build the pipeline, make sure gender comes before ner.

The sequence "edward james karla edward" is tagged "O O PERSON PERSON" by the NER tagger. I am not entirely sure why those first two tokens get "O" for their NER tags. I would note that "Edward James Karla Edward" yields "PERSON PERSON PERSON PERSON", and keep in mind the NER tagger factors in position in the sentence, so perhaps being lower cased at the beginning of the sentence is causing the first token "edward" to be marked as "O"?

If you have any issues with this, please let me know and I will be happy to help more!

TL;DR

1.) Karla is marked wrong because that name is not in the gender dictionary

2.) You can make your own gender mappings file with NAME\tGENDER , make sure the property "gender.firstnames" is set to path of your new gender mapping file.

3.) Make sure the gender annotator goes before the ner annotator, this should fix the problem!

Why is stanford corenlp gender identification nondeterministic?

1 Answers1

Linked