Is any place I can download Treebank of English phrases for free or less than $100? I need training data containing bunch of syntactic parsed sentences (>1000) in English in any format. Basically all I need is just words in this sentences being recognized by part of speech.
Asked
Active
Viewed 2.0k times
25
-
1Does NLTK not contain a sizeable subset of the Penn Treebank? – Hans Then Sep 07 '13 at 00:12
-
7@on-hold: actually, this is a very useful question and the answers are also very useful, since these are comparatively scarce resources. Mind, this is not a "is A better then B" question, but a "list all resources of type X under condition Y". – rec Sep 07 '13 at 18:03
-
3It's ridiculous that the LDC charges for data sets... Anyway, see https://en.wikipedia.org/wiki/Treebank#Syntactic_treebanks – Franck Dernoncourt Jul 20 '15 at 05:16
3 Answers
24
Here are a couple (English) treebanks available for free:
American National Corpus: MASC
Questions: QuestionBank and Stanford's corrections
British news: BNC
TED talks: NAIST-NTT TED Treebank
Georgetown University Multilayer Corpus: GUM
Biomedical:
See also Wikipedia for a huge list.

dmcc
- 2,519
- 28
- 30
16
NLTK (for Python) offers several treebanks for free.

kkm inactive - support strike
- 5,190
- 2
- 32
- 59

cyborg
- 9,989
- 4
- 38
- 56
-
Thanks, +1. I'm not familiar with Python, so advise me please how can I parse this *.pickle files? Is any converter to something more user friendly like XML or just plain text? – YMC Jan 21 '12 at 00:35
-
2What pickle file? The Treebanks are in text format. E.g., http://nltk.googlecode.com/svn/trunk/nltk_data/packages/corpora/treebank.zip . – cyborg Jan 21 '12 at 00:47
-
519 languages for free here: http://universaldependencies.github.io/docs/ – CpILL Mar 30 '15 at 11:32
-
Hindi and Urdu Dependency treebank: http://ltrc.iiit.ac.in/treebank_H2014/ – Saurav-- Sep 09 '18 at 13:57
-1
what about Penn Treebank? I hope it will be free or atleast afordable. http://www.cis.upenn.edu/~treebank/cdrom2.html

Seid.M
- 185
- 7
-
1It costs $3150 at LDC: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42 – YMC Jan 21 '12 at 00:36
-
7It's included, along with lots of other treebanks, in OntoNotes 4.0 http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2011T03 which is free (though you have to pay a distribution cost). – Jeff Kaufman Sep 21 '12 at 12:47
-
-
1@CpILL You need to register to the website. It's a shame that some NLP researchers don't share data sets for free. LDC data sets can be really expensive. To make it worse, the taxpayers fund that nonsense. – Franck Dernoncourt Nov 12 '16 at 05:03
-
@JeffKaufman It's ridiculous it cannot be downloaded. 30 USD for shipping one DVD… – Franck Dernoncourt Nov 12 '16 at 05:07