How can I globally visiting a huge dict in each mapper of Hadoop map-reduce program?

Question

I'm doing a co-occur analysis on huge web logs. I have computed the occur times for each item, and the co-occur times for each pair of <item1, item2> using hadoop.

Now, I want to compute some correlation measure for a pair <item1,item2>, such as n_12/(n_1*n_2), where n means the occur or cooccur times of items or item pairs. I've arranged the data as:

key: item1
value: [(item1, n_1) (item2, n_12) ... (itemk, n_1k)]

I'm wondering that how can I know n_2, ..., n_k while processing the key-value about item1?

Thank you for your help.

score 2 · Answer 1 · edited May 17 '13 at 12:28

2

You mean you need to access a particular dictionary in each mapper? You can use 'distributed cache' feature of hadoop.This works for smaller dictionaries. How huge can the dictionary be? If it is in GBs you might have to resort to reduce side join.

edited May 17 '13 at 12:28

Bill the Lizard

398,270
210
566
880

answered Mar 08 '13 at 09:44

Eswara Reddy Adapa

995
5
11

Thank you for your answer! The diction is in GBs. Join is just what I need. – rudaoshi Mar 09 '13 at 16:52

How can I globally visiting a huge dict in each mapper of Hadoop map-reduce program?

1 Answers1