0

I have the following dataframe

import pandas as pd
import numpy as np

dt = pd.DataFrame({'var1': list(np.random.randint(1, 200, 300)), 'var2': list(np.random.randint(1, 200, 300)),
                   'weight': [1.25]*250 + [6.25]*50,
                   'target': ['a'] * 20 + ['b'] * 20 + ['c'] * 120 + ['d'] * 140,
                   'gender': ['M']*250 + ['F']*50})

I want to perform a classification on the target, using var1, var2 and gender

As you can see the target variable is imbalanced (the size of the classes a, b, c and d varies).

The weight column contains the observation (survey) weights which makes the sample representative to the population (because in my data set 5/6 are Males and only 1/6 are Females, whereas in real world the proportion Males/Females is around 50/50)

My question is, how can I perform a classification using RF while incorporating the weight column as well ?

The sample_weight argument in the sklearn package would take into consideration the imbalance of the target variable on my dataset, but what I am interested in, is if there is a way for the RF to make the splits in the decision tress by using something like "weighted Gini index" to calculate the impurity of the node instead of just the Gini index which "weights" all observations equally

quant
  • 4,062
  • 5
  • 29
  • 70
  • You can use `sample_weight` during `fit()` as mentioned in the question I linked. If still not solved or you think the linked question does not have what you required please edit the question with more information and let me know so that I can reopen it – Vivek Kumar Dec 03 '19 at 17:19
  • @VivekKumar If I understand the documentation correctly, the `sample_weight` refers to imbalances, whereas I am talking about "observation weights". So, I think, that the question in the link refers to something else than my question – quant Dec 04 '19 at 10:00
  • Observation means sample. The linked question shows why and how the `sample_weight` in sklearn can be used. Your question when saying `"The weights are observation weights so that my sample is representative to the population"` seems to be talking about that only. Please explain your use-case in more detail. For example: What does it mean to the observation when its weight is 2 vs when its 4.. – Vivek Kumar Dec 04 '19 at 10:13
  • I have reopened the question. But your current explanation still adds to the confusion. You say that the classes are imbalanced, but then want to just weight the observations only on 1 feature (gender is a feature)?? Or want to apply the weight to the whole observation (sample or row), based on gender. If latter, `sample_weight` is still the right way to do – Vivek Kumar Dec 04 '19 at 11:06
  • Sounds to me like you want to undersample the larger category (males) or augment the smaller one (females) – Itamar Mushkin Dec 04 '19 at 11:21

0 Answers0