I have the following dataframe
import pandas as pd
import numpy as np
dt = pd.DataFrame({'var1': list(np.random.randint(1, 200, 300)), 'var2': list(np.random.randint(1, 200, 300)),
'weight': [1.25]*250 + [6.25]*50,
'target': ['a'] * 20 + ['b'] * 20 + ['c'] * 120 + ['d'] * 140,
'gender': ['M']*250 + ['F']*50})
I want to perform a classification on the target
, using var1
, var2
and gender
As you can see the target
variable is imbalanced (the size of the classes a
, b
, c
and d
varies).
The weight
column contains the observation (survey) weights which makes the sample representative to the population (because in my data set 5/6
are Males
and only 1/6
are Females
, whereas in real world the proportion Males/Females
is around 50/50
)
My question is, how can I perform a classification using RF while incorporating the weight
column as well ?
The sample_weight
argument in the sklearn
package would take into consideration the imbalance of the target
variable on my dataset, but what I am interested in, is if there is a way for the RF to make the splits in the decision tress by using something like "weighted Gini index" to calculate the impurity of the node instead of just the Gini index which "weights" all observations equally