1

I have a thousands of tables, which each contains hundreds of words and and and their corresponding score in the second column. and I need to calculate the correlation of each pair of tables.

So, I started to read each table, and convert it to a dictionary; each word is a dictionary key, and its score is the value.

Now it is a time to calculate the correlations. I have to mention, not necessarily all dictionaries have the same keys; some more, some less; each dictionary should get expanded according to its pair - meaning if the pair has some key which does not exist in the other, the other dictionary should get updated by those key and those key's value should be 0 and eventually then the correlation coefficient must be calculated.

example:

dict1 = {'car': 0.1, 'dog':0.3, 'tiger':0.5, 'lion': 0.1, 'fish':0.2}
dict2 = {'goat':0.3, 'fish':0.3, 'shark':0.4, 'dog':0.3}

so, dict1 should get look like :

dict1.comparable = {'car':0.1, 'goat':0.0 ,'dog':0.3, 'tiger':0.5, 'lion': 0.1, 'fish':'0.2, 'shark':0.0}
dict2.comparable = {'car': 0.0, 'goat':0.3, 'dog':0.3, 'fish':0.3, 'shark':0.4, ,'tiger':0, 'lion': 0}

and then the correlation of their values should be calculated.

I appreciate how to do calculate the similarity/correlation of dictionaries based on their values efficiently.

UPDATE

Here is a post which explain how to compute correlation coefficient technically.

here is the simplest version

import numpy
numpy.corrcoef(list1, list2)[0, 1]

but it only works on "list". Basically I am after calculating correlation coefficient, of two dictionary with respect to their keys, in an efficient manner. (less amount of expanding and sorting keys)

Community
  • 1
  • 1
Areza
  • 5,623
  • 7
  • 48
  • 79

2 Answers2

3
keys = list(dict1.viewkeys() | dict2.viewkeys())
import numpy
numpy.corrcoef(
    [dict1.get(x, 0) for x in keys],
    [dict2.get(x, 0) for x in keys])[0, 1]

First you get all the keys. No need to sort, but de-duplication is needed. Storing it as a list helps to iterate them in the same order later.

Then you can create the 2 lists that numpy requires.

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • keys = list(dict1.viewkeys() | dict2.viewkeys()) returns error, on python 2.6, is it same as keys = set(dict1), keys.union(dict2) ?! I Appreciate if you update your post. – Areza May 23 '13 at 00:46
  • You are right, viewkeys() was added on python 2.7. Makes the solution more lightweight though! `keys = set(dict1.keys()) | set(dict2.keys())` would be 2.6 answer. http://docs.python.org/2/library/stdtypes.html#dict.viewkeys – Felipe Hoffa May 23 '13 at 01:16
  • wow, It is interesting how python, sort the keys (or at least list two dictionaries according to the keywords in the same order) – Areza May 23 '13 at 01:51
1

Don't add zeros to the dictionary. Those are just bloat, and would be eliminated when the similarity is calculated. Leaving out zeros will already save you some, if not a lot of time.

Then, to calculate the similarity, start with the shortest dictionary of the two. For each key in the shortest, check if the key is in the longest dictionary. That also saves a lot of time, because looping over a dict with N items takes N time, while checking if that item is in the larger dict takes only 1 time.

Don't create the intermediate dictionaries, if it is just to calculate similarity. It wastes time and memory.

To eventually calculate similarity, you can try the cosine metric, euclidian distance, or something else, depending on your needs.

pvoosten
  • 3,247
  • 27
  • 43
  • I see your point that excluding zeros is more efficient, but it is not statistically true ! those with a shorted keys, have a more chance to have higher correlation ! – Areza May 22 '13 at 23:31
  • 1
    Your comment is not completely correct. I wrote my answer before your update of the question, before you specified the meaning of similarity/correlation, which can be many different things, including metrics that don't require to keep zeros. Those metrics support sparse data better, and are more efficient. Besides, nothing is "statisticaly true"... There is always bias. – pvoosten May 23 '13 at 19:12