Is there a resource of lots of human text?

Question

I just coded a Markov chain that talks based on learned data. I'd like a resource of a lot of text data online, but can't seem to find any (most sites like Wikipedia have a lot of junk, not plain text files).

Is there any site that would have a lot of text file that is suitable to test a Markov chain on?

score 2 · Accepted Answer · answered Mar 14 '16 at 03:05

2

gutenberg.org might have some resources for you. For example, here's what appears to be a bunch of Moby Dick, in text file form.

http://www.gutenberg.org/files/2701/2701.txt

answered Mar 14 '16 at 03:05

mock_blatt

955
5
11

score 1 · Answer 2 · answered Mar 14 '16 at 03:02

1

If your concern is just removing the tag from wikipedia, how about using source like this one that they remove the tag for you?

http://kopiwiki.dsd.sztaki.hu/

answered Mar 14 '16 at 03:02

cytsunny

4,838
15
62
129

score 0 · Answer 3 · answered Mar 14 '16 at 02:51

0

Have you tried NLTK text corpora?

answered Mar 14 '16 at 02:51

Warden

106
5

Aren't those usually just words, as opposed to full sentences? – Cisplatin Mar 14 '16 at 02:53
They include many sentences, such as president speeches, books and etc. – Warden Mar 14 '16 at 18:20

score 0 · Answer 4 · answered Mar 14 '16 at 03:06

0

Consider the Enron Email Dataset: https://www.cs.cmu.edu/~./enron/

It is also hosted on Amazon AWS: https://aws.amazon.com/datasets/enron-email-data/

answered Mar 14 '16 at 03:06

Ewan Mellor

6,747
1
24
39

Is there a resource of lots of human text?

4 Answers4