I have written a simple k-means clustering code for Hadoop (two separate programs - mapper and reducer). The code is working over a small dataset of 2d points on my local box. It's written in Python and I plan to use Streaming API.
After each run of mapper and reducer, new centres are generated. These centres are input for the next iteration.
Based on the suggestions, I used mrjob,job python which is suitable for multi steps,
def steps(self):
return [self.mr(mapper=self.anything,
combiner=self.anything,
reducer=self.anything)]
This is just one iteration and please tell me any way to feed back to mapper after the new centers are generated. what I meant is that as you see at the last step ("reducer") the new centers will be generated and now it is time to feed it back to mapper( first step ) again to calculate new distances with new centers and so on until the satisfied converge.
( please do not tell me about Mahout, spark or any other implementation, I am aware of them.)