I have a dataset(300MB) on which I wish to run k means clustering using Mahout. The data is in a form of csv which contains only numerical values. Is it still necessary to input the file in vectorized format for the mahout k means command? If not, how can i run the k means command directly on my csv file without converting it to a vector format?
Asked
Active
Viewed 680 times
1 Answers
1
If your data is 300 MB, the answer is don't use Mahout at all.
Really ONLY EVER use Mahout when your data no longer fits into memory. Map Reduce is expensive, you only want to use it when you can't solve the problem without.

Has QUIT--Anony-Mousse
- 76,138
- 12
- 138
- 194
-
This was just a sample data, My actual data is quite large. But the problem is that I have only numerical values in the data. Is the mahout k means applicable only for string values, or can I run it successfully for only numerical values as well? Has anyone tried this? Kindly Reply – user3036420 Nov 27 '13 at 12:33
-
K-means computes *means*. It is actually *only* applicable to numerical data. Have you ever read a description of k-means? – Has QUIT--Anony-Mousse Nov 27 '13 at 14:32
-
@Anony-Mousse: https://imiloainf.wordpress.com/2013/07/27/mahout-kmeans-example/ Tells Mahout kmeans is mainly for text processing, if you need to process some numerical data, you need to write some utility functions to write the numerical data into sequence-vector format. – USB Jan 02 '15 at 10:07
-
@SreeVeni So what? Yes, you can run k-means on TF-IDF vectors. But Mahout will still suck; and it will still be awfully slow. – Has QUIT--Anony-Mousse Jan 02 '15 at 11:05
-
@Anony-Mousse: I tried running Kmeans with synthetic data http://unmeshasreeveni.blogspot.in/2014/11/how-to-run-k-means-clustering-in-mahout.html How can we supply our own data to run kmeans. Any tutorial regarding the same? – USB Jan 03 '15 at 03:49
-
Don't ask me. I already told you to forget about using Mahout for clustering. It's really really really slow and hard to use. – Has QUIT--Anony-Mousse Jan 03 '15 at 10:15