Mahout k-means on Hadoop

Question

I want to run kmeans clustering on a Hadoop pseudo-distributed mode. I have 5 million of vectors in a .mat file, with 38 numeric features for each vector, like this: 0 0 1 0 0 0 0 0 0 0 0 0 ...

I've run the examples that I've found, like Reuters (mhttps://mahout.apache.org/users/clustering/k-means-clustering.html) or synthetic data. I know i have to convert this vectors to SequenceFile, but I don't know if I have to do something more before.

I'm using Mahout 0.7 and Hadoop 1.2.1.

score 0 · Answer 1 · edited May 23 '17 at 12:05

Yes, you need a small preprocessing step.

Since the MAT file generated is a Binary File, converting it into a text file (.txt) with each line begin a vector with 38 feature values would be the first step.

Then, using SeqDirectory (or writing your own SequenceFileWriter to get it done) would be next step and all the other steps follow as in the Reuters example.

Example for your own Sequence File Writer would be How to convert .txt file to Hadoop's sequence file format

I tried the same for Mahout LDA where I wrote my own Sequence File Writer and gave it as an input to the next step in LDA process namely seq2sparse.

score 0 · Answer 2 · answered May 25 '14 at 15:58

Never use pseudo-distributed mode

Mahout only pays off if you have data that is way too large to be analyzed on a single computer, but where you really need at least a dozen computers to hold and process the data.

The reason is the architecture. Mahout is built on top of map-reduce and relies on writing plenty of iterim data to disk, to be able to recover from crashes.

In pseudo-distributed mode, it cannot recover from such crashes well anyway.

Pseudo-distributed mode is okay if you want to learn installing and configuring Mahout, without having access to a real cluster. It is not reasonable to use for analyzing real data.

Instead, use the functionality built-in into Matlab, or use a clustering tool designed for single nodes such as ELKI. It will usually outperform Mahout by an order of magnitude by not writing everything to disk a number of times. In my experiments, these tools were able to outperform a 10 core Mahout cluster by a factor of 10 on a single core. Because I/O cost completely dominates runtime.

Benchmark yourself

If you don't trust me on this, benchmark yourself. Load the reuters data into Matlab, and cluster it there. I'm pretty sure, Matlab will make Mahout look like an old fad.

Mahout k-means on Hadoop

2 Answers2

Never use pseudo-distributed mode

Benchmark yourself