5

Can anyone point me to some large corpus that I use for classification?

But by large I don't mean Reuters or 20 newsgroups, I'm talking about a corpus of GB size, not 20MB or something like that.

I was able only to find this Reuters and 20 newsgroups, which is very small for the thing I need.

Kobe-Wan Kenobi
  • 3,694
  • 2
  • 40
  • 67

2 Answers2

6

The most popular datasets for text-classification evaluation are:

However the datasets above does not meet the 'large' requirement. Below datasets might meet your criteria:

  • Commoncrawl You could build a large corpus by extracting articles that have specific keywords in the meta tag and apply to document classification.

  • Enron Email Dataset You could do a variety of different classifcation tasks here.

  • Topic Annotated Enron Dataset . Not free but already labelled and meets your large corpus request

You can browse other publicly available datasets here

Other than the above you might have to develop your own corpus.I will be releasing a news corpus builder later this weekend that will help you develop custom corpora based on topics of your choice

Update:

Had created the custom corpus builder module I mentioned above but forgot to link it News Corpus Builder

Skillachie
  • 3,176
  • 1
  • 25
  • 26
1

Huge Reddit archive spanning 10/2007 to 5/2015

Bob Dillon
  • 341
  • 1
  • 7
  • Thanks, but this doesn't seem like a labeled, classification ready, dataset? – Kobe-Wan Kenobi Aug 27 '15 at 11:15
  • What exactly do you mean by labeled? – maj Aug 27 '15 at 12:14
  • @maj I mean a corpus of documents where for each document you know to which class it belongs, for example - sports, history, music, etc. – Kobe-Wan Kenobi Aug 27 '15 at 13:40
  • The archive is in JSON format so the tet is easily parsed out and being Reddit, is well organized. The difference between r/Drugs and drugs is semantic IMHO. It's not completely formatted for ML, but it's as close as any dataset I've seen, particularly one of this size and scope. Let us know if you find what you're looking for as we all may have use for it too. – Bob Dillon Aug 27 '15 at 13:51