Iterative kmeans based on mapreduce and hadoop

Question

I have written a simple k-means clustering code for Hadoop (two separate programs - mapper and reducer). The code is working over a small dataset of 2d points on my local box. It's written in Python and I plan to use Streaming API.

After each run of mapper and reducer, new centres are generated. These centres are input for the next iteration.

Based on the suggestions, I used mrjob,job python which is suitable for multi steps,

def steps(self):    
 return [self.mr(mapper=self.anything,

                            combiner=self.anything,
                            reducer=self.anything)]

This is just one iteration and please tell me any way to feed back to mapper after the new centers are generated. what I meant is that as you see at the last step ("reducer") the new centers will be generated and now it is time to feed it back to mapper( first step ) again to calculate new distances with new centers and so on until the satisfied converge.

( please do not tell me about Mahout, spark or any other implementation, I am aware of them.)

There is a great walk-through (with code) for doing exactly this I saw the other week: http://www.classes.cs.uchicago.edu/archive/2013/spring/12300-1/labs/lab3/ — David Manheim, Jun 24 '14 at 20:14

score 1 · Answer 1 · edited Jan 29 '16 at 22:47

1

While running K-Means to stop execution we normally define number of iteration or the threshold distance. In this we may want to write a chain map reduce for the number of iteration. Write the output of cluster centroids to a temp file and feed it to next mapper. Do it times equal to your thresholds.

edited Jan 29 '16 at 22:47

gsamaras

71,951
46
188
305

answered Jun 16 '14 at 11:52

Tanveer

890
12
22

Iterative kmeans based on mapreduce and hadoop

1 Answers1