I am looking to train either a random forest or gradient boosting algorithm using sklearn. The data I have is structured in a way that it has a variable weight for each data point that corresponds to the amount of times that data point occurs in the dataset. Is there a way to give sklearn this weight during the training process, or do I need to expand my dataset to a non-weighted version that has duplicate data points each represented individually?
Asked
Active
Viewed 2,917 times
5
-
You could just include that weight or frequency of occurrence as a column in your dataset. – Turtalicious May 07 '19 at 20:05
-
Logically I don't think that would work though. The column would be only one variable for that data point when predicting instead of having an entire duplicate data point with every variable in that row being equivalent. – Stephen Strosko May 07 '19 at 20:12
1 Answers
7
You can definitely specify the weights while training these classifiers in scikit-learn
. Specifically, this happens during the fit
step. Here is an example using RandomForestClassifier
but the same goes also for GradientBoostingClassifier
:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np
data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 42)
Here I define some arbitrary weights just for the sake of the example:
weights = np.random.choice([1,2],len(y_train))
And then you can fit your model with these models:
rfc = RandomForestClassifier(n_estimators = 20, random_state = 42)
rfc.fit(X_train,y_train, sample_weight = weights)
You can then evaluate your model on your test data.
Now, to your last point, you could in this example resample your training set according to the weights by duplication. But in most real world examples, this could end up being very tedious because
- you would need to make sure all your weights are integers to perform duplication
- you would have to uselessly multiply the size of your data, which is memory-consuming and is most likely going to slow down the training procedure

MaximeKan
- 4,011
- 11
- 26