4

I am using Scikit-learn to train a classification model. I have both discrete and continuous features in my training data.

I want to do feature selection using mutual information.

The features 1,2 and 3 are discrete. to this end, I try the code below :

mutual_info_classif(x, y, discrete_features=[1, 2, 3])

but it did not work, it gives me the error:

 ValueError: could not convert string to float: 'INT'
Amir
  • 16,067
  • 10
  • 80
  • 119
samira
  • 63
  • 1
  • 1
  • 5
  • I have apply the code that Mr W.P. McNeill have proposed in https://stackoverflow.com/q/43643278 but did not work – samira Nov 25 '18 at 17:42
  • 1
    we need more information in order to be able to help you. It might be useful if you copy a simplified example of your code. – silgon Nov 25 '18 at 18:14
  • this is my code: from sklearn.feature_selection import mutual_info_classif res_M_train = mutual_info_classif(data_train, Y_train, discrete_features= [1,2,3]) thank you – samira Nov 25 '18 at 18:23
  • my data is like this :[0.983874,tcp,http,FIN,10,8,816,1172,17.278635,62,252,5976.375,8342.53125,2,2,109.319333,124.932859,5929.211713,192.590406,255,794167371,1624757001,255,0.206572,0.108393,0.098179,82,147,1,184,2,1,1,1,1,2,0,0,1,1,3,0,] as you can see my three first features are categoricale , and I want to calculate the mutual information of each feature: from sklearn.feature_selection import mutual_info_classif res_M_train = mutual_info_classif(data_train, Y_train, discrete_features= [1,2,3]) – samira Nov 25 '18 at 18:28

3 Answers3

4

A simple example with mutual information classifier:

import numpy as np
from sklearn.feature_selection import mutual_info_classif
X = np.array([[0, 0, 0],
              [1, 1, 0],
              [2, 0, 1],
              [2, 0, 1],
              [2, 0, 1]])
y = np.array([0, 1, 2, 2, 1])
mutual_info_classif(X, y, discrete_features=True)
# result: array([ 0.67301167,  0.22314355,  0.39575279]
silgon
  • 6,890
  • 7
  • 46
  • 67
  • but I have mixed features like this X = np.array([[0, a, 0], [1, b, 0], [2, c,1], [2, d, 1], [2, a, 1]]) – samira Nov 25 '18 at 18:35
  • this is a row from my Data [8e-06,"udp","-","INT",2,0,1762,0,125000.0003,254,0,881000000.0,0.0,0,0,0.008,0.0,0.0,0.0,0,0,0,0,0.0,0.0,0.0,881,0,0,0,2,2,1,1,1,2,0,0,0,1,2,0] it seems that the three first features cause the problem – samira Nov 26 '18 at 00:12
  • if you're using categories and you have string information, take a look to [`get_dummies`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) – silgon Nov 26 '18 at 09:29
2

mutual_info_classif can only take numeric data. You need to do label encoding of the categorical features and then run the same code.

x1=x.apply(LabelEncoder().fit_transform)

Then run the exact same code you were running.

mutual_info_classif(x1, y, discrete_features=[1, 2, 3])
Jatin
  • 21
  • 3
  • Care with that @Jatin, refering to sklearn's docs: `This transformer should be used to encode target values, i.e. y, and not the input X`. So maybe for this case it is a better option to use [OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html). – rmoret Dec 22 '21 at 09:03
  • @rmoret Does it matter for calculating mutual information? "Not limited to real-valued random variables and linear dependence like the correlation coefficient, MI is more general and determines how different the joint distribution of the pair (X,Y) is from the product of the marginal distributions of X and Y. MI is the expected value of the pointwise mutual information (PMI)." [Mutual Information](https://en.wikipedia.org/wiki/Mutual_information) Since we only care about shared information, ordering should not matter? – DataJanitor Feb 14 '23 at 08:41
0

.There is a difference between 'discrete' and 'categorical' In this case, function demands the data to be numerical. May be you can use label encoder if you have ordinal features. Else you would have to use one hot encoding for nominal features. You can use pd.get_dummies for this purpose.

Parul Singh
  • 363
  • 3
  • 11
  • Same here. Does it matter whether you have ordinal features for calculating mutual information? "Not limited to real-valued random variables and linear dependence like the correlation coefficient, MI is more general and determines how different the joint distribution of the pair (X,Y) is from the product of the marginal distributions of X and Y. MI is the expected value of the pointwise mutual information (PMI)." [Mutual Information](https://en.wikipedia.org/wiki/Mutual_information) Since we only care about shared information, ordering should not matter? – DataJanitor Feb 14 '23 at 08:44