I have a thousands of tables, which each contains hundreds of words and and and their corresponding score in the second column. and I need to calculate the correlation of each pair of tables.
So, I started to read each table, and convert it to a dictionary; each word is a dictionary key, and its score is the value.
Now it is a time to calculate the correlations. I have to mention, not necessarily all dictionaries have the same keys; some more, some less; each dictionary should get expanded according to its pair - meaning if the pair has some key which does not exist in the other, the other dictionary should get updated by those key and those key's value should be 0 and eventually then the correlation coefficient must be calculated.
example:
dict1 = {'car': 0.1, 'dog':0.3, 'tiger':0.5, 'lion': 0.1, 'fish':0.2}
dict2 = {'goat':0.3, 'fish':0.3, 'shark':0.4, 'dog':0.3}
so, dict1 should get look like :
dict1.comparable = {'car':0.1, 'goat':0.0 ,'dog':0.3, 'tiger':0.5, 'lion': 0.1, 'fish':'0.2, 'shark':0.0}
dict2.comparable = {'car': 0.0, 'goat':0.3, 'dog':0.3, 'fish':0.3, 'shark':0.4, ,'tiger':0, 'lion': 0}
and then the correlation of their values should be calculated.
I appreciate how to do calculate the similarity/correlation of dictionaries based on their values efficiently.
UPDATE
Here is a post which explain how to compute correlation coefficient technically.
here is the simplest version
import numpy
numpy.corrcoef(list1, list2)[0, 1]
but it only works on "list". Basically I am after calculating correlation coefficient, of two dictionary with respect to their keys, in an efficient manner. (less amount of expanding and sorting keys)