-1

I have a dataset of 10K, and I created the following ten features:

  • Distance - (0 or 1)
  • IsPronoun - (True or False)
  • String Match - (True or False)
  • Demonstrative NP - (True if i and j is demonstrative pronoun)
  • Number Agreement - (check if i or j is singular or plural pronoun)
  • Semantic compatibility - (if i and j semantically compitable)
  • Gender agreement - (check if i or j is male/female)
  • IsProperNoun - (find i or j is proper noun or not)
  • Appositive - (find if i is opposit of j)
  • Alias - (find if i is alias of j or vice verses)

Each of the features has an output from the dataset. Now I want to make the tree. But first, how should I calculate the entropy and information gain?

Zia
  • 376
  • 2
  • 13

1 Answers1

0

you can use mutual_info_classif from sklearn.feature_selection, but you'll need to define your target (dependent) variable. Assuming all of your attributes are discrete (nominal):

from sklearn.feature_selection import mutual_info_classif
print (mutual_info_classif(X_vec, Y, discrete_features=True))
Roee Anuar
  • 3,071
  • 1
  • 19
  • 33
  • The problem is that the dataset is unstructured and all the feature extract it own output from the dataset. How should I use the sklearn to define the values of x and y or so on? – Zia Jun 03 '20 at 11:48
  • So you'll have to structure it first. Y is always the target (dependent) variable - what are you trying to predict? X is the set of independent variables - you use them to predict the target variable. – Roee Anuar Jun 03 '20 at 11:52
  • Then how should be the structure of the dataset? because we operate on i and j if both have an output for the above-mentioned features. – Zia Jun 03 '20 at 12:26
  • Add i and j as features as well – Roee Anuar Jun 03 '20 at 12:28
  • So it will be a table with 13 column, where the last column will have the output of yes/no? – Zia Jun 03 '20 at 12:31
  • 1
    yes - but the target variable shouldn't be a part of X - it should be a separate vector – Roee Anuar Jun 03 '20 at 12:36
  • yes, I take a separate column for target variable where it has the value of 0 or 1, if it is zero it means they are not related other it is related (referred) – Zia Jun 03 '20 at 15:25