I have a highly imbalanced data set from which I want to get both classification (binary) as well as probabilities. I have managed to use logistic regression as well as random forest to obtain results from cross_val_predict using class weights.
I am aware that RandomForestClassifier and LogisiticRegression can take class weight as an argument while KNeighborsRegressor and GaussianNB do not. However, for KNN and NB in the documentation it says that for that I can use fit which incorporates sample weights:
fit(self, X, y, sample_weight=None)
So I was thinking of working around it by calculating class weights and using these to create an array of sample weights depending on the classification of the sample. Here is the code for that:
c_w = class_weight.compute_class_weight('balanced', np.unique(y), y)
sw=[]
for i in range(len(y)):
if y[i]==False:
sw.append(c_w[0])
else:
sw.append(c_w[1])
Not sure if this workaround makes sense, however I managed to fit the model using this method and I seem to get better results in terms of my smaller class.
The issue now is that I want to use this method in sklearn's
cross_val_predict()
however I am not managing to pass sample weights through cross validation.
I have 2 questions:
- Does my workaround to use sample weights to substitute class weights make sense?
- Is there a way to pass sample weights through cross_val_predict just like you would when you use fit without cross validation?