25

Is any place I can download Treebank of English phrases for free or less than $100? I need training data containing bunch of syntactic parsed sentences (>1000) in English in any format. Basically all I need is just words in this sentences being recognized by part of speech.

ahmadPH
  • 135
  • 11
YMC
  • 4,925
  • 7
  • 53
  • 83
  • 1
    Does NLTK not contain a sizeable subset of the Penn Treebank? – Hans Then Sep 07 '13 at 00:12
  • 7
    @on-hold: actually, this is a very useful question and the answers are also very useful, since these are comparatively scarce resources. Mind, this is not a "is A better then B" question, but a "list all resources of type X under condition Y". – rec Sep 07 '13 at 18:03
  • 3
    It's ridiculous that the LDC charges for data sets... Anyway, see https://en.wikipedia.org/wiki/Treebank#Syntactic_treebanks – Franck Dernoncourt Jul 20 '15 at 05:16

3 Answers3

24

Here are a couple (English) treebanks available for free:

See also Wikipedia for a huge list.

dmcc
  • 2,519
  • 28
  • 30
16

NLTK (for Python) offers several treebanks for free.

cyborg
  • 9,989
  • 4
  • 38
  • 56
  • Thanks, +1. I'm not familiar with Python, so advise me please how can I parse this *.pickle files? Is any converter to something more user friendly like XML or just plain text? – YMC Jan 21 '12 at 00:35
  • 2
    What pickle file? The Treebanks are in text format. E.g., http://nltk.googlecode.com/svn/trunk/nltk_data/packages/corpora/treebank.zip . – cyborg Jan 21 '12 at 00:47
  • 5
    19 languages for free here: http://universaldependencies.github.io/docs/ – CpILL Mar 30 '15 at 11:32
  • Hindi and Urdu Dependency treebank: http://ltrc.iiit.ac.in/treebank_H2014/ – Saurav-- Sep 09 '18 at 13:57
-1

what about Penn Treebank? I hope it will be free or atleast afordable. http://www.cis.upenn.edu/~treebank/cdrom2.html

Seid.M
  • 185
  • 7
  • 1
    It costs $3150 at LDC: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42 – YMC Jan 21 '12 at 00:36
  • 7
    It's included, along with lots of other treebanks, in OntoNotes 4.0 http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2011T03 which is free (though you have to pay a distribution cost). – Jeff Kaufman Sep 21 '12 at 12:47
  • how do you the distribution cost?? – CpILL Feb 19 '15 at 04:03
  • 1
    @CpILL You need to register to the website. It's a shame that some NLP researchers don't share data sets for free. LDC data sets can be really expensive. To make it worse, the taxpayers fund that nonsense. – Franck Dernoncourt Nov 12 '16 at 05:03
  • @JeffKaufman It's ridiculous it cannot be downloaded. 30 USD for shipping one DVD… – Franck Dernoncourt Nov 12 '16 at 05:07