0

I am trying to perform linear regression on Black friday dataset. When I get to the model training part, I tried to split my data set defining the X and y values and later performing the train test split.

And then I train my model using linear regression. After that I tried to plot a scatter plot for which I am getting an error as ValueError: x and y must be the same size.

Note: I already imported the dataset 'df'.

# Importing the necessary modules.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Creating the varibales X and y.

X= df.drop('Purchase', axis=1).values
y= df['Purchase'].values


# Splitting the dataframe to create a training and testing data set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# This creates a LinearRegression object
lm = LinearRegression()

# Fit a linear model, calculate the root mean squared error and the R2 score.
lm.fit(X_train, y_train)

y_pred_linear = lm.predict(X_test)
y_train_predict  = lm.predict(X_train)

rmse_train = np.sqrt(mean_squared_error(y_train,y_train_predict))
r2_train = r2_score(y_train,y_train_predict)

rmse = np.sqrt(mean_squared_error(y_test,y_pred_linear))
r2 = r2_score(y_test,y_pred_linear)

print('Root mean squared error on Training Set', rmse_train)
print('R2 score on Training Set: ', r2_train)

print('Root mean squared error on Test Set', rmse)
print('R2 score on Testing Set: ', r2)

plt.scatter(X_train, y_train, s=10)

When I do

X.shape I get the result as (537577, 83). But when I perform y.shape i get the result as (537577,).

Also when for the scatter plot I get the Value error. Basically I want to plot a scatter plot of predicted vs the actual result.

1 Answers1

0

The plot you are aiming for might not be very useful. Essentially you have 83 different variables cramped in y-axis having said that if that is what you desire this should be able to do the trick.

import matplotlib.pyplot as plt
number_of_data_to_plot = 500
random_sample = np.random.randint(0,X_train.shape[0],number_of_data_to_plot)

for i in range(X_train.shape[1]):
  plt.scatter(X_train[random_sample,i],y_train[random_sample])
Biranjan
  • 303
  • 3
  • 12
  • Thank you for your reply. I am trying that code in my notebook but its taking forever to give me a result. Any advises on how to rewrite my X and y variables ? – Sheema Murugesh Babu Jul 26 '19 at 19:06
  • Yes, I forgot to warn you about the time it takes. So I edited my answer accordingly. You can try randomly sampling a few numbers of data to plot instead of all the data. You can further reduce the number that is less than 500 to make it even faster. – Biranjan Jul 26 '19 at 19:53