Optimizing pairwise mutual information score

Asked Mar 27 '16 at 14:21

Active Mar 27 '16 at 14:53

Viewed 662 times

I am trying to compute the mutual information score between all the columns of a pandas dataframe,

from sklearn.metrics.cluster import adjusted_mutual_info_score
from itertools import combinations 

current_valid_columns = list(train.columns.difference(["ID"]))    

MI_scores = pd.DataFrame(columns=["features_pair","adjusted_mutual_information"])

current_index = 0 
for columns_pair in combinations(current_valid_columns, 2):
    row = pd.Series([str(columns_pair),adjusted_mutual_info_score(train[columns_pair[0]],train[columns_pair[1]])])
    MI_scores.loc[current_index] = row.values 
    current_index +=1 
MI_scores.to_csv("adjusted_mutual_information_score.csv", sep="|", index=False)

This works, but it's very slow on a dataframe with a large number of columns. How can I optimize it?

edited Mar 27 '16 at 14:53

asked Mar 27 '16 at 14:21

Mohamed Ali JAMAOUI

14,275
14
73
117

See [this question](http://stackoverflow.com/questions/20491028/optimal-way-to-compute-pairwise-mutual-information-using-numpy). – Ami Tavory Mar 27 '16 at 14:59
@AmiTavory it's almost the same from a performance standpoint, with the nested for loops (done here with itertools.combinations ) and the call to a function to compute the metric. – Mohamed Ali JAMAOUI Mar 27 '16 at 15:07
I agree. My point was to show you that it's been asked, and a faster alternative wasn't found. You probably won't find one for pandas. – Ami Tavory Mar 27 '16 at 15:14
@AmiTavory Thanks, I will try to find one and will share it if any. – Mohamed Ali JAMAOUI Mar 27 '16 at 19:17
It's an interesting question. Good luck to you. – Ami Tavory Mar 27 '16 at 19:21

Optimizing pairwise mutual information score

0 Answers0