Feature selection when independent variables are categorical and also target variable is categorical

Question

I have presented a small sample of the dataset that I am working on. My original dataset has around 400 columns for 'Symptoms' and 1 column for 'Disease'. From here the output expected is to find out the top 'N' maybe 10 or some number of 'Symptoms' which are most significant for a particular disease. My sample dataset is as follows:

fever    headche     sore throat          drowsiness               Disease
    0        0         1                   0                      Fungal infection
    0        0         0                   1                      Fungal infection
    0        1         0                   0                      liver infection
    1        0         0                   1                      diarrhoea
    0        0         1                   1                      common cold
    0        1         1                   0                      diarrhoea
    1        0         0                   0                      flu

I have tried using sklearn's SelectKBest but cannot comprehend the results. Also want to know if panda's dataframe.corr function can work in this case

If I got you clearly, you want to sum up values in each column per disease and identify the column with the highest n items for each row? That is the n leading symptoms per disease? — wwnde, Jul 06 '20 at 00:05
@wwnde I didn't mention this point clearly, my bad. I am not looking for a sum. What I am looking for is like a correlation between each symptom. Like for example, if I input headache, I want to find out which other symptoms have got the greatest probability to occur with headache. So, given a symptom of headache, give me the top 'N' symptoms that can occur with headache. Hope that makes it clear or please let me know. — Lalit, Jul 06 '20 at 03:21

score 1 · Accepted Answer · answered Jul 06 '20 at 16:57

One way to address this problem is using a naive bayes classifier with feature probabilities modelled as Bernoulli distributions. This assumes that the target variables are not categorical variables as you mention in the question but simply binary variables. I think that's a more reasonable assumption and it seems to me it follows from the construction of your input data where the input variables appear to be binary.

A first model pass can be the following (adapting the important_features function from this answer:

import numpy as np
import pandas as pd
from sklearn.naive_bayes import BernoulliNB

def important_features(classifier,feature_names, n=20):
    class_labels = classifier.classes_

    for i,feature in enumerate(feature_names): 
        print("Important features in ", class_labels[i])
        topn_class = sorted(zip(classifier.feature_log_prob_[i], feature_names),
                            reverse=True)[:n]
        
        for coef, feat in topn_class:
            print(coef, feat)
        print('-----------------------')

d = {}
d['fever'] = np.array([0,0,0,1,0,0,1])
d['headache'] = np.array([0,0,1,0,0,1,0])
d['sorethroat'] = np.array([1,0,0,0,1,1,0])
d['drowsiness'] = np.array([0,1,0,1,1,0,0])
d['disease'] = ['Fungal infection','Fungal infection','liver infection',
           'diarrhoea','common cold','diarrhoea','flu']

df = pd.DataFrame(d)

X = df[df.columns[:-1]]
y = df['disease']

clf = BernoulliNB()
clf.fit(X, y)
BernoulliNB()

important_features(clf,df.columns[:-1])

This should give you the following output, which of course is just for demonstration purposes as I only used the data you provided above:

Important features in  Fungal infection
-0.6931471805599453 sorethroat
-0.6931471805599453 drowsiness
-1.3862943611198906 headache
-1.3862943611198906 fever
-----------------------
Important features in  common cold
-0.4054651081081645 sorethroat
-0.4054651081081645 drowsiness
-1.0986122886681098 headache
-1.0986122886681098 fever
-----------------------
Important features in  diarrhoea
-0.6931471805599453 sorethroat
-0.6931471805599453 headache
-0.6931471805599453 fever
-0.6931471805599453 drowsiness
-----------------------
Important features in  flu
-0.4054651081081645 fever
-1.0986122886681098 sorethroat
-1.0986122886681098 headache
-1.0986122886681098 drowsiness
-----------------------

Naive bayes of course doesn't account for correlation between the independent variables e.g. one could be more likely to have headache if they have fever anyway and independently of the underlying disease. If this limitation is not an issue for you then you could go ahead and run the model for all your data. Note that it's probably really really difficult to train a more general model which estimates all the possible correlations from the data.

Finally note that pandas corr method will give you the correlation of the independent variables but it won't have anything to do with a model predicting the disease from the inputs.

thanks for the answer. I have a huge dataset and I have just provided a sample of it. There are around 400 symptoms. Is there any optimized way to convert those to a numpy array like you have shown in example 'd['fever'] = np.array([0,0,0,1,0,0,1])'. Otherwise manually doing this seems really difficult. — Lalit, Jul 06 '20 at 19:03
You shouldn't do it manually. How do you read the data set to begin with? If you read it like a dataframe then you can go straight to the step that begins with `X = df[df.columns[:-1]]` — LeoC, Jul 06 '20 at 19:44
Thanks. Tried it. What's happening is it's more of a count based important features. So if I am taking top 10, 1st value is unique and rest all are same. This is happening all throughout the dataset mostly. — Lalit, Jul 06 '20 at 20:48

Feature selection when independent variables are categorical and also target variable is categorical

1 Answers1