I have a data set that has approximately 1 billion data points. There are about 46 million unique data points I want to extract from this.
I want to use Hadoop to extract the unique values, but keep getting "Out of Memory" and Java heap size errors on Hadoop - at the same time, I am able to run this fairly easily on a single box using a Python Set (hashtable, if you will.)
I am using a fairly simple algorithm to extract these unique values: I am parsing the 1 billion lines in my map and outputting lines that look like this:
UniqValueCount:I a
UniqValueCount:I a
UniqValueCount:I b
UniqValueCount:I c
UniqValueCount:I c
UniqValueCount:I d
and then running the "aggregate" reducer to get the results, which should look like this for the above data set:
I 4
This works well for a small set of values, but when I run this for the 1 billion data points (which have 46 million keys, as I mentioned) the job fails.
I'm running this on Amazon's Elastic Map Reduce, and even if I use six m2.4xlarge nodes (their maximum memory nodes at 68.4 GB each) the job fails with the "out of memory" errors.
But I am able to extract the unique values using a Python code with a Set data structure (hash table) on one single m1.large (a much smaller box with 8 GB memory). I am confused that the Hadoop job fails since 46 million uniques should not take up that much memory.
What could be going wrong? Am I using the UniqValueCount wrong?