1

I am trying to compute the mutual information score between all the columns of a pandas dataframe,

from sklearn.metrics.cluster import adjusted_mutual_info_score
from itertools import combinations 

current_valid_columns = list(train.columns.difference(["ID"]))    

MI_scores = pd.DataFrame(columns=["features_pair","adjusted_mutual_information"])

current_index = 0 
for columns_pair in combinations(current_valid_columns, 2):
    row = pd.Series([str(columns_pair),adjusted_mutual_info_score(train[columns_pair[0]],train[columns_pair[1]])])
    MI_scores.loc[current_index] = row.values 
    current_index +=1 
MI_scores.to_csv("adjusted_mutual_information_score.csv", sep="|", index=False)

This works, but it's very slow on a dataframe with a large number of columns. How can I optimize it?

Mohamed Ali JAMAOUI
  • 14,275
  • 14
  • 73
  • 117

0 Answers0