Editing the NLTK Corpus

Question

In addition to the corpus that comes with nltk I want to train it with my own corpus that follows the same part of speech rules. How can I find the corpus that it is using, and how can I add my own corpus (in addition, not as a replacement)?

EDIT: Here is the code that I am currently using:

inpy = raw_input("$")
text = nltk.word_tokenize(inpy)
d = nltk.pos_tag(text)

score 1 · Accepted Answer · answered Mar 11 '15 at 20:30

1

NLTK comes with a substantial number of different corpora. It would help if you specified in more detail which corpus you want to augment. The main English POS corpus in NLTK is the Brown corpus. See also http://www.nltk.org/book/ch05.html as well as http://en.wikipedia.org/wiki/Brown_Corpus and http://www.nltk.org/nltk_data/

answered Mar 11 '15 at 20:30

tripleee

175,061
34
275
318

I am using the UPenn tagset (I believe. I am not 100% sure). I don't want to augment them but add a corpus so that when it trains its classifier, it can be more accurate. – Greencat Mar 11 '15 at 20:48
Then it's probably the fragment from the Penn Treebank; #17 from the last link. You might then actually be better off replacing it entirely because it's rather old and gritty; google for English treebank corpora. – tripleee Mar 12 '15 at 04:12
http://stackoverflow.com/questions/8949517/is-there-any-treebank-for-free and https://catalog.ldc.upenn.edu/LDC2012T13 out of the top Google results. – tripleee Mar 12 '15 at 04:25

Editing the NLTK Corpus

1 Answers1