One way to address this problem is using a naive bayes classifier with feature probabilities modelled as Bernoulli distributions. This assumes that the target variables are not categorical variables as you mention in the question but simply binary variables. I think that's a more reasonable assumption and it seems to me it follows from the construction of your input data where the input variables appear to be binary.
A first model pass can be the following (adapting the important_features
function from this answer:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import BernoulliNB
def important_features(classifier,feature_names, n=20):
class_labels = classifier.classes_
for i,feature in enumerate(feature_names):
print("Important features in ", class_labels[i])
topn_class = sorted(zip(classifier.feature_log_prob_[i], feature_names),
reverse=True)[:n]
for coef, feat in topn_class:
print(coef, feat)
print('-----------------------')
d = {}
d['fever'] = np.array([0,0,0,1,0,0,1])
d['headache'] = np.array([0,0,1,0,0,1,0])
d['sorethroat'] = np.array([1,0,0,0,1,1,0])
d['drowsiness'] = np.array([0,1,0,1,1,0,0])
d['disease'] = ['Fungal infection','Fungal infection','liver infection',
'diarrhoea','common cold','diarrhoea','flu']
df = pd.DataFrame(d)
X = df[df.columns[:-1]]
y = df['disease']
clf = BernoulliNB()
clf.fit(X, y)
BernoulliNB()
important_features(clf,df.columns[:-1])
This should give you the following output, which of course is just for demonstration purposes as I only used the data you provided above:
Important features in Fungal infection
-0.6931471805599453 sorethroat
-0.6931471805599453 drowsiness
-1.3862943611198906 headache
-1.3862943611198906 fever
-----------------------
Important features in common cold
-0.4054651081081645 sorethroat
-0.4054651081081645 drowsiness
-1.0986122886681098 headache
-1.0986122886681098 fever
-----------------------
Important features in diarrhoea
-0.6931471805599453 sorethroat
-0.6931471805599453 headache
-0.6931471805599453 fever
-0.6931471805599453 drowsiness
-----------------------
Important features in flu
-0.4054651081081645 fever
-1.0986122886681098 sorethroat
-1.0986122886681098 headache
-1.0986122886681098 drowsiness
-----------------------
Naive bayes of course doesn't account for correlation between the independent variables e.g. one could be more likely to have headache if they have fever anyway and independently of the underlying disease. If this limitation is not an issue for you then you could go ahead and run the model for all your data. Note that it's probably really really difficult to train a more general model which estimates all the possible correlations from the data.
Finally note that pandas corr
method will give you the correlation of the independent variables but it won't have anything to do with a model predicting the disease from the inputs.