1

I'm doing a co-occur analysis on huge web logs. I have computed the occur times for each item, and the co-occur times for each pair of <item1, item2> using hadoop.

Now, I want to compute some correlation measure for a pair <item1,item2>, such as n_12/(n_1*n_2), where n means the occur or cooccur times of items or item pairs. I've arranged the data as:

key: item1
value: [(item1, n_1) (item2, n_12) ... (itemk, n_1k)]

I'm wondering that how can I know n_2, ..., n_k while processing the key-value about item1?

Thank you for your help.

harpun
  • 4,022
  • 1
  • 36
  • 40
rudaoshi
  • 53
  • 5

1 Answers1

2

You mean you need to access a particular dictionary in each mapper? You can use 'distributed cache' feature of hadoop.This works for smaller dictionaries. How huge can the dictionary be? If it is in GBs you might have to resort to reduce side join.

Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880