0

I am dealing with unbalanced data and trying to improve my model by using stratified data. The problem is that I am unsure how to do so exactly. Everything I have tried so far doesn't change anything.

It should be something like this:

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.2, shuffle = True, random_state = 0, stratify = y_train)

but it doesn't matter if I pass the "stratify" parameter or not. My Data is OneHot encoded and y_train looks like this: [[1. 0.] [1. 0.] [0. 1.] ... [0. 1.] [0. 1.] [1. 0.]]

As far as I understand stratify needs my two classes but I am unsure how to do that.

EDIT: It doesn't matter if I set stratify = y_train or not because the dimensions of y_train doesn't change.

Thanks!

Kirk1746
  • 23
  • 3

1 Answers1

0

It's not the dimension that stratify changes. It's the proportion of the classes. For instance, if your proportion of the original data for 1 is 20% and 80% for the 0 then, in the splitted data, the porportion should be the same.

I hope it's helps.

sharmajee499
  • 116
  • 5
  • Thank, I know but my problem is that I know my data is unbalanced but stratify does not do anything. The train/test data remains the same. – Kirk1746 Jul 23 '20 at 13:30
  • To clarify what my problem is: It doesn't make a difference if I set "stratify = y_train" or leave it out because my data remains: "Train on 191549 samples, validate on 47888 samples" even though the training samples should be decreasing. I just don't understand why it isn't working. – Kirk1746 Jul 23 '20 at 14:28