How to calculate the information gain and entropy of a dataset with ten features?

Question

I have a dataset of 10K, and I created the following ten features:

Distance - (0 or 1)
IsPronoun - (True or False)
String Match - (True or False)
Demonstrative NP - (True if i and j is demonstrative pronoun)
Number Agreement - (check if i or j is singular or plural pronoun)
Semantic compatibility - (if i and j semantically compitable)
Gender agreement - (check if i or j is male/female)
IsProperNoun - (find i or j is proper noun or not)
Appositive - (find if i is opposit of j)
Alias - (find if i is alias of j or vice verses)

Each of the features has an output from the dataset. Now I want to make the tree. But first, how should I calculate the entropy and information gain?

score 0 · Answer 1 · answered Jun 03 '20 at 11:41

0

you can use mutual_info_classif from sklearn.feature_selection, but you'll need to define your target (dependent) variable. Assuming all of your attributes are discrete (nominal):

from sklearn.feature_selection import mutual_info_classif
print (mutual_info_classif(X_vec, Y, discrete_features=True))

answered Jun 03 '20 at 11:41

Roee Anuar

3,071
1
19
33

The problem is that the dataset is unstructured and all the feature extract it own output from the dataset. How should I use the sklearn to define the values of x and y or so on? – Zia Jun 03 '20 at 11:48
So you'll have to structure it first. Y is always the target (dependent) variable - what are you trying to predict? X is the set of independent variables - you use them to predict the target variable. – Roee Anuar Jun 03 '20 at 11:52
Then how should be the structure of the dataset? because we operate on i and j if both have an output for the above-mentioned features. – Zia Jun 03 '20 at 12:26
Add i and j as features as well – Roee Anuar Jun 03 '20 at 12:28
So it will be a table with 13 column, where the last column will have the output of yes/no? – Zia Jun 03 '20 at 12:31
1

yes - but the target variable shouldn't be a part of X - it should be a separate vector – Roee Anuar Jun 03 '20 at 12:36
yes, I take a separate column for target variable where it has the value of 0 or 1, if it is zero it means they are not related other it is related (referred) – Zia Jun 03 '20 at 15:25

How to calculate the information gain and entropy of a dataset with ten features?

1 Answers1