How to maintain data entry id in Mahout K-means clustering

Question

I'm using mahout to run k-means clustering, and I got a problem of identifying the data entry when clustering, for example I have a 100 data entries

id      data
0       0.1 0.2 0.3 0.4
1       0.2 0.3 0.4 0.5
...     ...
100     0.2 0.4 0.4 0.5

after clustering, I need to get the id back from the cluster result to see which point belongs to which cluster, but there seems no method to maintain the id.

In the official mahout example of clustering synthetic control data, only data were inputted to mahout without id like

28.7812 34.4632 31.3381 31.2834 28.9207 ...
...
24.8923 25.741  27.5532 32.8217 27.8789 ...

and the cluster result only have cluster-id and point value:

VL-539{n=38 c=[29.950, 30.459, ...
   Weight:  Point:
   1.0: [28.974, 29.026, 31.404, 27.894, 35.985...
   2.0: [24.214, 33.150, 31.521, 31.986, 29.064

but no point-id exists, so, can anyone have idea on how to add maintain a point-id when doing mahout clustering? thank you very much!

score 2 · Answer 1 · answered Feb 12 '13 at 17:43

To achieve that I use NamedVectors.

As you know, before doing any clusterization with your data, you have to vectorize it.

This means that you have to transform your data into Mahout vectors, because that is the kind of data that clusterization algoritms work with.

Vectorization process will depend on the nature of your data, i.e. vectorizing text is not the same to vectorize numerical values.

Your data seems to be easily vectorizable, since it only have an ID and 4 numerical values.

You could write a Hadoop Job that takes your input data, for example, as a CSV file, and outputs a SequenceFile with your data already vectorized.

Then, you apply the Mahout clustering algorithms to this input and you will keep the ID (vector name) of each vector in the clustering results.

An example job to vectorize your data could be implemented with the following classes:

public class DenseVectorizationDriver extends Configured implements Tool{

    @Override
    public int run(String[] args) throws Exception {
        if (args.length != 2) {
            System.err.printf("Usage: %s [generic options] <input> <output>\n", getClass().getSimpleName());
            ToolRunner.printGenericCommandUsage(System.err); return -1;
        }
        Job job = new Job(getConf(), "Create Dense Vectors from CSV input");
        job.setJarByClass(DenseVectorizationDriver.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapperClass(DenseVectorizationMapper.class);
        job.setReducerClass(DenseVectorizationReducer.class);

        job.setOutputKeyClass(LongWritable.class);
        job.setOutputValueClass(VectorWritable.class);

        job.setOutputFormatClass(SequenceFileOutputFormat.class);

        return job.waitForCompletion(true) ? 0 : 1;
    }
}


public class DenseVectorizationMapper extends Mapper<LongWritable, Text, LongWritable, VectorWritable>{
/*
 * This mapper class takes the input from a CSV file whose fields are separated by TAB and emits
 * the same key it receives (useless in this case) and a NamedVector as value.
 * The "name" of the NamedVector is the ID of each row.
 */
    @Override
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String line = value.toString();
        System.out.println("LINE: "+line);
        String[] lineParts = line.split("\t", -1);    
        String id = lineParts[0];

        //you should do some checks here to assure that this piece of data is correct

        Vector vector = new DenseVector(lineParts.length -1);
        for (int i = 1; i < lineParts.length -1; i++){
            String strValue = lineParts[i];
            System.out.println("VALUE: "+strValue);
            vector.set(i, Double.parseDouble(strValue));

        }

        vector =  new NamedVector(vector, id);

        context.write(key, new VectorWritable(vector));
    }
}


public class DenseVectorizationReducer extends Reducer<LongWritable, VectorWritable, LongWritable, VectorWritable>{
/*
 * This reducer simply writes the output without doing any computation.
 * Maybe it would be better to define this hadoop job without reduce phase.
 */
    @Override
    public void reduce(LongWritable key, Iterable<VectorWritable> values, Context context) throws IOException, InterruptedException{

        VectorWritable writeValue = values.iterator().next();
        context.write(key, writeValue);
    }
}

I didn't go through all of your code, but your first line was enough. "NamedVector"! — Asad Iqbal, Apr 30 '14 at 08:47

score 0 · Answer 2 · answered Feb 26 '12 at 14:26

Your request is often overlooked by programmers who are not themselves practitioners... unfortunately. I do not know how to do it Mahout (so far), but I started with Apache-commons-math, which includes a K-means with the same defect. I adapted it such that your request is satisfied. You will find it here: http://code.google.com/p/noolabsimplecluster/ Additionally, don't forget to normalize (linearly) the data to the interval [0..1], otherwise any clustering algo will produce garbage!

score 0 · Answer 3 · answered Apr 02 '12 at 10:08

0

The clusteredPoints directory which is produced by the kmeans contains this mapping. Please note that you should have used the -cl option to get this data.

answered Apr 02 '12 at 10:08

Hossein

40,161
57
141
175

How to maintain data entry id in Mahout K-means clustering

3 Answers3