How to fix 'ValueError: Found input variables with inconsistent numbers of samples: [32979, 21602]'?

Question

I am making a Logistic Regression model to do sentiment analysis. This is the problem - ValueError: Found input variables with inconsistent numbers of samples: [32979, 21602] This occurs when I try to split my dataset into x and y train and valid sets.

# splitting data into training and validation set 
xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train_bow, train['label'], test_size=0.3, random_state=42)
lreg = LogisticRegression() # training the model 
lreg.fit(xtrain_bow, ytrain) 
prediction = lreg.predict_proba(xvalid_bow) # predicting on the validation set 
prediction_int = prediction[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0 
prediction_int = prediction_int.astype(np.int) 
f1_score(yvalid, prediction_int) # calculating f1 score for the validation set

I saw in some posts that it can occur because of the shape of the X and y, so printed out the shapes of the datset, I have splitted my dataset into 85% for training and rest for test/valid purpose.

# Extracting train and test BoW features
split_frac = 0.85

split_num = int(len(combi['tidy_tweet']) * split_frac)

train_bow = bow[:split_num,:] 
test_bow = bow[split_num:,:] 
print(train_bow.shape)
print(test_bow.shape)
print(train['label'].shape)

(32979, 1000)
(5820, 1000)
(21602,)

Also the problem is in this line-

----> 1 xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train_bow, train['label'], test_size=0.3, random_state=42)
      2 lreg = LogisticRegression() # training the model
      3 lreg.fit(xtrain_bow, ytrain)

Now I am clueless, that what is actually causing the problem? Can you guys please help? Thanks in advance.

score 0 · Answer 1 · answered Jul 06 '19 at 12:44

0

if can you comment out the f1_score and try, it should not give you that error. Let me know if it works, Thanks

answered Jul 06 '19 at 12:44

benai

69
1
4

Thanks, but the problem is in the line of train_test_split. Can you please help me regarding this? – Deb Prakash Chatterjee Jul 06 '19 at 13:12

Anubhav Singh · Accepted Answer · 2019-07-06T17:11:04.413

0

You are getting above error because the length of second parameter, i.e., the label, in train_test_split() is 21602 while the length of first parameter is 32979, which should not be. The length both X and Y data must be same. So, check the length of train_bow and train['label'].

So, just change

xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train_bow, train['label'], test_size=0.3, random_state=42) to something like below:

xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(bow[:split_num,:-1], bow[:split_num,-1], test_size=0.3, random_state=42)

(Assuming bow contains both features and labels, labels being the last column).

Read more sklearn.model_selection.train_test_split from here.

edited Jul 06 '19 at 17:11

answered Jul 06 '19 at 17:03

Anubhav Singh

8,321
4
25
43

Thanks, but bow does not contain labels, bow only contain tidy preprocessed tweets, `bow = bow_vectorizer.fit_transform(combi['tidy_tweet']) `. labels are taken from train variable, which is the actual csv file. So, do I need to change my `y` from `bow[:split_num,-1]` to `train['label']` or anything else? – Deb Prakash Chatterjee Jul 06 '19 at 19:00
Okay. But, you get the idea what I want to say? – Anubhav Singh Jul 06 '19 at 19:02
You have to pass label of equal length as the first parameter, i.e., X. – Anubhav Singh Jul 06 '19 at 19:04
Above example is just an example. – Anubhav Singh Jul 06 '19 at 19:04
Okay, means labels of size (32979, ) instead of (21602,)? Is it? – Deb Prakash Chatterjee Jul 06 '19 at 19:10
Thanks for your help, it is really informative, but one last question, should I reshape my train['label'] to (32979, ) to make it (32979, )? – Deb Prakash Chatterjee Jul 06 '19 at 19:13
(32979, ) and (32979, ) are same. Just passed it. – Anubhav Singh Jul 06 '19 at 19:13
no no, I am asking train['label'] is (21602, ), so I should reshape it, right? to make it (32979, ), right? – Deb Prakash Chatterjee Jul 06 '19 at 19:15
then? how can I change the shape? – Deb Prakash Chatterjee Jul 06 '19 at 19:17
Or you can do one thing just change `train_bow = bow[:split_num,:]` to `train_bow = bow[:len(train['label']),:]` – Anubhav Singh Jul 06 '19 at 19:19
Well, can you please elaborate, I am pretty new in this field, especially about sci-kit learn. in the second part, `train_bow` or `xvalid_bow`? – Deb Prakash Chatterjee Jul 06 '19 at 19:20
Yeah, X and Y should be of same length.`len(X)>len(Y)` in your case. I don't know what's your label comprised of. But if it's randomly generated data, then make your X data of same length as Y data. – Anubhav Singh Jul 06 '19 at 19:23
Thanks, a lot, I am trying this, if I get stuck somewhere, I will comment down. Thanks again. – Deb Prakash Chatterjee Jul 06 '19 at 19:25
Yeah Sure. But try what I said above. First check if your label data is correct. Secondly, if it is some random generated data, then change `train_bow = bow[:split_num,:]` to `train_bow = bow[:len(train['label']),:]` inside `train_test_split()`. – Anubhav Singh Jul 06 '19 at 19:27
Bro, I am done, succeeded, but the problem is now I have to shrink the train data to 65% instead of 80%. Is there a way to solve this? less training data means less accuracy, right? – Deb Prakash Chatterjee Jul 06 '19 at 19:34
Yeah. But you can't just add random values to the label data. It should have some meaning. What you want to achieve. That's why, I told you to check if your label data is correct. – Anubhav Singh Jul 06 '19 at 19:35
yeah, I have checked my label data, it aligned according to each tweet. I know we can't just add random data in the label, but there is no way out then to make my train set 80% and do it? – Deb Prakash Chatterjee Jul 06 '19 at 19:41
Well, Thanks though, you made my day. – Deb Prakash Chatterjee Jul 06 '19 at 19:51

How to fix 'ValueError: Found input variables with inconsistent numbers of samples: [32979, 21602]'?

2 Answers2