Efficient Hadoop Word counting for large file

Question

I want to implement a hadoop reducer for word counting. In my reducer I use a hash table to count the words.But if my file is extremely large the hash table will use extreme amount of memory.How I can address this issue ? (E.g A file with 10 million lines each reducer receives 100million words how can he count the words a hash table requires 100million keys) My current implementation is in python. Is there a smart way to reduce the amount of memory?

Am I missing something? But if you're just looking to count words, you dont need a hash table 100m long, as you're going to get a lot of repetition. For instance, you might get 250k entries for the word `the`. Can't you just stream the data line by line through a function that increments a `collections.Counter`? — kreativitea, Dec 01 '12 at 20:32
sorry my mistake I wanted to say that for 100milion distinct words for example I need 100milion entries in a hash table — nikosdi, Dec 01 '12 at 20:35
100 million distinct words, really? http://oxforddictionaries.com/words/how-many-words-are-there-in-the-english-language — Chris White, Dec 01 '12 at 20:37
Ok of course maybe its not realistic I am just looking for a better method !:P (In my cause also where I don't have real words I really memory issue) — nikosdi, Dec 01 '12 at 20:38
@nikosdi There's simply not that many words in the English language, or in every language combined, even. Unless you're not from this planet, you're not going to need that many hashes. — kreativitea, Dec 01 '12 at 20:38
Ok! I understand! But if I wanted to reduce the memory usage can this be implemented? — nikosdi, Dec 01 '12 at 20:40
@nikosdi You can use stemming of some kind to merge similar entries in your hash. Look at NLTK for a built in stemmer. — kreativitea, Dec 01 '12 at 20:43

score 0 · Accepted Answer · answered Dec 01 '12 at 20:42

0

The most efficient way to do this is to maintain a hash map of word frequency in your mappers, and flush them to the output context when they reach a certain size (say 100,000 entries). Then clear out the map and continue (remember to flush the map in the cleanup method too).

If you still truely have 100 of millions of words, then you'll either need to wait a long time for the reducers to finish, or increase your cluster size and use more reducers.

answered Dec 01 '12 at 20:42

Chris White

29,949
4
71
93

Ok I am using a combiner In the mapper at the moment.But I still have this problem because if my reducer receives more distinct words than can fit in memory It wont continue... Ok I got it! This is the part of the answer buying a bigger Haddop cluster! ;) – nikosdi Dec 01 '12 at 20:45
1

By the time you get to the reducer, you don't need to hold everything in memory - surely you just add up the counts for each key and output. What you you need to maintain a hashmap in the reducer? – Chris White Dec 01 '12 at 22:16
A reducer can receive word A from mapper A1 and A2.So the reducer must also perform a count.......A1 sends {N,100} A2 sends {N,100}... – nikosdi Dec 01 '12 at 22:38
2

You don't need a hashmap in the reducer. In your mapper you output (word,1) every time a word occurs. Hadoop's shuffle & sort ensures that each reduce call contains as values all "1" for each unique "word". You simply iterate over the Iterable and add the "1" up. Then you emit and you are done. – anonymous1fsdfds Dec 02 '12 at 10:33

Efficient Hadoop Word counting for large file

1 Answers1