How to weigh data points with sklearn training algorithms

Question

I am looking to train either a random forest or gradient boosting algorithm using sklearn. The data I have is structured in a way that it has a variable weight for each data point that corresponds to the amount of times that data point occurs in the dataset. Is there a way to give sklearn this weight during the training process, or do I need to expand my dataset to a non-weighted version that has duplicate data points each represented individually?

You could just include that weight or frequency of occurrence as a column in your dataset. — Turtalicious, May 07 '19 at 20:05
Logically I don't think that would work though. The column would be only one variable for that data point when predicting instead of having an entire duplicate data point with every variable in that row being equivalent. — Stephen Strosko, May 07 '19 at 20:12

score 7 · Accepted Answer · answered May 08 '19 at 03:08

You can definitely specify the weights while training these classifiers in scikit-learn. Specifically, this happens during the fit step. Here is an example using RandomForestClassifier but the same goes also for GradientBoostingClassifier:

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np

data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 42)

Here I define some arbitrary weights just for the sake of the example:

weights = np.random.choice([1,2],len(y_train))

And then you can fit your model with these models:

rfc = RandomForestClassifier(n_estimators = 20, random_state = 42)
rfc.fit(X_train,y_train, sample_weight = weights)

You can then evaluate your model on your test data.

Now, to your last point, you could in this example resample your training set according to the weights by duplication. But in most real world examples, this could end up being very tedious because

you would need to make sure all your weights are integers to perform duplication
you would have to uselessly multiply the size of your data, which is memory-consuming and is most likely going to slow down the training procedure

How to weigh data points with sklearn training algorithms

1 Answers1

Linked