2

I have a very large dataset (500 Million) of documents and want to cluster all documents according to their content.

What would be the best way to approach this? I tried using k-means but it does not seem suitable because it needs all documents at once in order to do the calculations.

Are there any cluster algorithms suitable for larger datasets?

For reference: I am using Elasticsearch to store my data.

fwind
  • 1,274
  • 4
  • 15
  • 32
  • Do you know the labels/categories already? Something like "spam/nonspam"? Or "entertainment/health/politics/sports..."? Or do you have to find the number of topics and the topics themselves from the documents first? – knb May 12 '15 at 11:28
  • No I don't have labels for them. My first approach was generate a Tf-Idf-Matrix for every article and cluster them by this matrix. – fwind May 12 '15 at 11:38
  • 500 million documents is large, but not unmanageable. How did your Tf-Idf approach work? Seems like that should give a reasonable first approximation without using too terribly much memory. – Jim Mischel May 12 '15 at 15:34
  • **Why?** The result will be useless, even if you scale it up to 500 mio documents. – Has QUIT--Anony-Mousse May 12 '15 at 18:29
  • @Anony-Mousse: Can you clarify what you mean? – fwind May 13 '15 at 16:16

2 Answers2

1

According to Prof. J. Han, who is currently teaching the Cluster Analysis in Data Mining class at Coursera, the most common methods for clustering text data are:

  • Combination of k-means and agglomerative clustering (bottom-up)
  • topic modeling
  • co-clustering.

But I can't tell how to apply these on your dataset. It's big - good luck.

For k-means clustering, I recommend to read the dissertation of Ingo Feinerer (2008). This guy is the developer of the tm package (used in R) for text mining via Document-Term-matrices.

The thesis contains case-studies (Ch. 8.1.4 and 9) on applying k-Means and then the Support Vector Machine Classifier on some documents (mailing lists and law texts). The case studies are written in tutorial style, but the dataset are not available.

The process contains lots of intermediate steps of manual inspection.

knb
  • 9,138
  • 4
  • 58
  • 85
0

There are k-means variants thst process documents one by one,

MacQueen, J. B. (1967). Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability 1.

and k-means variants that repeatedly draw a random sample.

D. Sculley (2010). Web Scale K-Means clustering. Proceedings of the 19th international conference on World Wide Web

Bahmani, B., Moseley, B., Vattani, A., Kumar, R., & Vassilvitskii, S. (2012). Scalable k-means++. Proceedings of the VLDB Endowment, 5(7), 622-633.

But in the end, it's still useless old k-means. It's a good quantization approach, but not very robust to noise, not capable of handling clusters of different size, non-convex shape, hierarchy (e.g. sports, inside baseball) etc. it's a signal processing technique, not a data organization technique.

So the practical impact of all these is 0. Yes, they can run k-means on insane data - but if you can't make sense of the result, why would you do so?

Community
  • 1
  • 1
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194