Where in the CoreNLP code are the Penn Treebank part-of-speech symbols themselves actually represented?

Question

I'm looking specifically for some data structure, enum, or generative process through which the different parts-of-speech are represented internally. I've spent a long time scanning the Javadoc and the source code for a while and can't find what I'm looking for. I would like to access a collection of the tags directly, if possible, if they're stored in some central location. Please forgive me if the question I'm posing constitutes a naive assumption regarding the way CoreNLP pos-tagging operates, but if what I'm describing does exist in some form, this would be very helpful. Thanks!

score 1 · Accepted Answer · answered Mar 26 '17 at 18:45

1

I'm not actually sure they're represented explicitly anywhere in the code. The tagger simply outputs them as Strings rather than any sort of fixed enum, and the output space is inferred directly from the training data. The advantage of this being that you can train the exact same model on arbitrary tag sets. And of course the disadvantage you've just run into. :)

However, for English, the tag set should be the Penn Treebank tag set: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

answered Mar 26 '17 at 18:45

Gabor Angeli

5,729
1
18
29

Thanks for the answer. Yeah, that's what I figured might be the case. I am aware that it uses PTB (and from your answer, now I understand why the training mechanics don't conform to a specific treebank model). It still would be nice for my current project if I could access the values somehow—even programmatically—rather than having to trust my own abilities to copy the values for comparison into my own enum by hand. Us programmers generally hate to do things like this rather than automating somehow, so you could understand where this is coming from. Oh, well. Thanks again for the help. :) – David Kriz Mar 26 '17 at 21:23
1

So, you can try to take a look at `AbstractSequenceClassifier#labels()`, which will give you the sequence model's view of the label space. But, (1) this doesn't necessarily have to be correct (e.g., it could in theory have more labels than are in the training set), and (2) it's a pain to get at from the actual pipeline. I'd recommend just hard-coding the tags yourself into an enum. A lot of things change over time in CoreNLP, but the POS tag set is not likely to be one of them – Gabor Angeli Mar 26 '17 at 22:05

score 0 · Answer 2 · answered Nov 02 '22 at 14:55

0

I've found that this seems to give the most description to the tags:

https://github.com/stanfordnlp/CoreNLP/blob/main/data/edu/stanford/nlp/upos/ENUniversalPOS.tsurgeon

However, many of the HeadFinder files contain POS tags as well:

https://github.com/stanfordnlp/CoreNLP/blob/main/src/edu/stanford/nlp/trees/CollinsHeadFinder.java

answered Nov 02 '22 at 14:55

Ryan Schumacher

1,816
2
21
33

http://surdeanu.cs.arizona.edu/mihai/teaching/ista555-fall13/readings/PennTreebankConstituents.html – Ryan Schumacher Nov 06 '22 at 19:40

Where in the CoreNLP code are the Penn Treebank part-of-speech symbols themselves actually represented?

2 Answers2