I'm doing a co-occur analysis on huge web logs. I have computed the occur times for each item, and the co-occur times for each pair of <item1, item2>
using hadoop.
Now, I want to compute some correlation measure for a pair <item1,item2>
, such as n_12/(n_1*n_2)
, where n
means the occur or cooccur times of items or item pairs. I've arranged the data as:
key: item1
value: [(item1, n_1) (item2, n_12) ... (itemk, n_1k)]
I'm wondering that how can I know n_2, ..., n_k
while processing the key-value about item1
?
Thank you for your help.