0

I just coded a Markov chain that talks based on learned data. I'd like a resource of a lot of text data online, but can't seem to find any (most sites like Wikipedia have a lot of junk, not plain text files).

Is there any site that would have a lot of text file that is suitable to test a Markov chain on?

Cisplatin
  • 2,860
  • 3
  • 36
  • 56

4 Answers4

2

gutenberg.org might have some resources for you. For example, here's what appears to be a bunch of Moby Dick, in text file form.

http://www.gutenberg.org/files/2701/2701.txt

mock_blatt
  • 955
  • 5
  • 11
1

If your concern is just removing the tag from wikipedia, how about using source like this one that they remove the tag for you?

http://kopiwiki.dsd.sztaki.hu/

cytsunny
  • 4,838
  • 15
  • 62
  • 129
0

Have you tried NLTK text corpora?

Warden
  • 106
  • 5
0

Consider the Enron Email Dataset: https://www.cs.cmu.edu/~./enron/

It is also hosted on Amazon AWS: https://aws.amazon.com/datasets/enron-email-data/

Ewan Mellor
  • 6,747
  • 1
  • 24
  • 39