7

I am developing a parser in ruby which parses some nonuniform text data. Can anybody tell me, where I can get a good number of plaintext data for that?

Phrogz
  • 296,393
  • 112
  • 651
  • 745

2 Answers2

6

Here's you'll get a list of many:

http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public

And my fav is:

http://ftp.sunet.se/mirror/archive/ftp.sunet.se/pub/tv+movies/imdb/

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
intellidiot
  • 11,108
  • 4
  • 34
  • 41
5

You could scrape Wikipedia (or just run a bunch of it through lynx -dump). That would also give you a vast source of non-English text as well. Project Gutenberg would be another good source of large amounts of plain text.

mu is too short
  • 426,620
  • 70
  • 833
  • 800
  • @Phrogz: I used to be a Gutenberg addict back in my "Palm Pilot and commuting on the bus" days. – mu is too short Apr 26 '11 at 04:14
  • Project Gutenberg as a very strict bot policy, they allow no more than 100 visits from the same ip address in a day. – kyle k Jul 02 '13 at 06:29
  • 2
    @kyle k That's ok. They have a torrent: http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project – Phil Oct 03 '13 at 17:14