I am trying to compute the mutual information score between all the columns of a pandas dataframe,
from sklearn.metrics.cluster import adjusted_mutual_info_score
from itertools import combinations
current_valid_columns = list(train.columns.difference(["ID"]))
MI_scores = pd.DataFrame(columns=["features_pair","adjusted_mutual_information"])
current_index = 0
for columns_pair in combinations(current_valid_columns, 2):
row = pd.Series([str(columns_pair),adjusted_mutual_info_score(train[columns_pair[0]],train[columns_pair[1]])])
MI_scores.loc[current_index] = row.values
current_index +=1
MI_scores.to_csv("adjusted_mutual_information_score.csv", sep="|", index=False)
This works, but it's very slow on a dataframe with a large number of columns. How can I optimize it?