6

I have a file containing vectors of data, where each row contains a comma-separated list of values. I am wondering how to perform k-means clustering on this data using mahout. The example provided in the wiki mentions creating sequenceFiles, but otherwise I am not sure if I need to do some type of conversion in order to obtain these sequenceFiles.

Dan Q
  • 2,227
  • 3
  • 25
  • 36
  • Do you need to use mahout for this or will anything do? There are a lot of clustering api's, tools, sample code etc. that would do this easily. If you have a single file your data points might be quite small, Mahout in theory is meant for large scale problems. – Steve Jan 11 '12 at 12:49
  • I'm looking at clustering data sets from here: http://www.grouplens.org/node/73 The largest data set potentially contains 10,000 by 72,000 data points. That is why I thought mahout might be best, WEKA crashes when I try to load the smaller data sets – Dan Q Jan 13 '12 at 16:55
  • Try http://glaros.dtc.umn.edu/gkhome/software , Weka also has an SDK. k-means is quite straight forward to implement in most languages so I'm sure you can find some code snippets on the google – Steve Jan 13 '12 at 21:25

2 Answers2

8

I would recommend manually reading in the entries from the CSV file, creating NamedVectors from them, and then using a sequence file writer to write the vectors in a sequence file. From there on, the KMeansDriver run method should know how to handle these files.

Sequence files encode key-value pairs, so the key would be an ID of the sample (it should be a string), and the value is a VectorWritable wrapper around the vectors.

Here is a simple code sample on how to do this:

    List<NamedVector> vector = new LinkedList<NamedVector>();
    NamedVector v1;
    v1 = new NamedVector(new DenseVector(new double[] {0.1, 0.2, 0.5}), "Item number one");
    vector.add(v1);

    Configuration config = new Configuration();
    FileSystem fs = FileSystem.get(config);

    Path path = new Path("datasamples/data");

    //write a SequenceFile form a Vector
    SequenceFile.Writer writer = new SequenceFile.Writer(fs, config, path, Text.class, VectorWritable.class);
    VectorWritable vec = new VectorWritable();
    for(NamedVector v:vector){
        vec.set(v);
        writer.append(new Text(v.getName()), v);
    }
    writer.close();

Also, I would recommend reading chapter 8 of Mahout in Action. It gives more details on data representation in Mahout.

Bojana Popovska
  • 571
  • 1
  • 4
  • 15
  • Do you know how I can get the vector names back from the clustering results? See http://stackoverflow.com/questions/14476706/dumping-clustering-result-with-vectors-names – exic Jan 24 '13 at 09:33
  • 1
    There's a small error in your example (thanks for posting it, BTW). Instead of "writer.append(new Text(v.getName()), v);" I think it needs to be "write.append(new Text(v.getName()), vec);". Otherwise you get an exception saying "java.io.IOException: wrong value class: org.apache.mahout.math.NamedVector is not class org.apache.mahout.math.VectorWritable" – user311121 Apr 16 '13 at 16:54
0

maybe you could use Elephant Bird to write vectors in mahout format

https://github.com/kevinweil/elephant-bird#hadoop-sequencefiles-and-pig