I am trying to perform linear regression on Black friday dataset. When I get to the model training part, I tried to split my data set defining the X and y values and later performing the train test split.
And then I train my model using linear regression. After that I tried to plot a scatter plot for which I am getting an error as ValueError: x and y must be the same size.
Note: I already imported the dataset 'df'.
# Importing the necessary modules.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Creating the varibales X and y.
X= df.drop('Purchase', axis=1).values
y= df['Purchase'].values
# Splitting the dataframe to create a training and testing data set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# This creates a LinearRegression object
lm = LinearRegression()
# Fit a linear model, calculate the root mean squared error and the R2 score.
lm.fit(X_train, y_train)
y_pred_linear = lm.predict(X_test)
y_train_predict = lm.predict(X_train)
rmse_train = np.sqrt(mean_squared_error(y_train,y_train_predict))
r2_train = r2_score(y_train,y_train_predict)
rmse = np.sqrt(mean_squared_error(y_test,y_pred_linear))
r2 = r2_score(y_test,y_pred_linear)
print('Root mean squared error on Training Set', rmse_train)
print('R2 score on Training Set: ', r2_train)
print('Root mean squared error on Test Set', rmse)
print('R2 score on Testing Set: ', r2)
plt.scatter(X_train, y_train, s=10)
When I do
X.shape I get the result as (537577, 83). But when I perform y.shape i get the result as (537577,).
Also when for the scatter plot I get the Value error. Basically I want to plot a scatter plot of predicted vs the actual result.