"ValueError: Found input variables with inconsistent numbers of samples: [40, 10]" Problem with splitting the data

Question

I am using a sample data from a Udemy course for the sake of training. There are 51 rows in the data and I am trying to print the score of the model. The error I get is:

ValueError: Found input variables with inconsistent numbers of samples: [40, 10]

I understand that [40,10] refers to the training and test as I set the test_size to "0.2".

The code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.compose import ColumnTransformer as ct
from sklearn.model_selection import train_test_split as tts

data = pd.read_csv("50_Startups.csv")

X = data.drop("Profit",axis = 1)
y = data[["Profit"]]


from sklearn.preprocessing import OneHotEncoder
cat = ["State"]
one_hot = OneHotEncoder()
transformer = ct([("one_hot", one_hot, cat)],remainder="passthrough")
transformed_X = transformer.fit_transform(X)

print(transformed_X)

from sklearn.ensemble import RandomForestRegressor as RFR
model = RFR()
X_train , y_train, X_test , y_test = tts(transformed_X,y,test_size=0.2)
model.fit(X_train,y_train)
print(model.score(X_test,y_test))

I tried changing "y" to "y.values.ravel()" but it did not work either. I understand that this error often comes up with Numpy arrays but what might have caused the problem with this code?

Thank you in advance.

score 1 · Accepted Answer · answered May 19 '21 at 17:53

1

Your mistake is in train_test_split function in the below code.

X_train , y_train, X_test, y_test = tts(transformed_X,y,test_size=0.2)

While you may have unnoticed it but you have swapped the variables y_train and X_test .

Use this instead:

X_train , X_test,  y_train, y_test = tts(transformed_X,y,test_size=0.2)

Read full documentation at

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

answered May 19 '21 at 17:53

Prakash Dahal

4,388
2
11
25

1

Ouch! I hadn't noticed. Thanks for helping. – cagatay.e.sahin May 19 '21 at 21:25

"ValueError: Found input variables with inconsistent numbers of samples: [40, 10]" Problem with splitting the data

1 Answers1