-1

I tried running the Lasso Regression with Crude oil price, I can't shuffle the train and test set when I split into train and test set

Crude Oil Price in 2020, it's very strange because of COVID-19

But I want to know how to fix the error on train and test set, I need to use this with no shuffle

# Import Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#%matplotlib inline 
plt.style.use('ggplot') 
import warnings; warnings.simplefilter('ignore')

# Read data from CSV to Pandas 
df = pd.read_csv('https://www.kaggle.com/yothinpukongnin/crude-oil-price?select=DB_2.csv
', index_col=0)
#df = df.iloc[ 0:108 , : ]
X = df.drop(['Dubai','EU_RUB'], axis=1)
y = df['Dubai']

# Split Train and Test Set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=7, shuffle = False)

#Lasso Regression 
from sklearn.linear_model import Lasso
reg = Lasso(alpha=0.5)
reg.fit(X_train, y_train)

#R^esults from traditional Lasso
from sklearn.metrics import mean_squared_error
print('Lasso Regression: R^2 score on training set', reg.score(X_train, y_train)*100)
print('Lasso Regression: R^2 score on test set', reg.score(X_test, y_test)*100)

R square for test set = -356

1 Answers1

3

If I understand your question, you are asking about the negative R^2 score.

This however is no error in the strict sense - R^2 score can be arbitrarily negative. It just means that your model does not perform well, actually it performs even worse than the model that would always predict the average value (that one would get R^2 score equal to 0).

Although it is producing a bad model, your code works technically correctly. Also the problem of the negative R^2 score is not directly connected to the splitting of the dataset to the train and test parts.

How exactly to make a better model is too complex a question to be answered here. Just a few hints (so that you know what topics to look for):

  • your dataset is very small with relatively many features so your model very probably overfits (supported also by a good training R^2 score) - learn how to diagnose and mitigate problems with over- and underfitting and the bias vs. variance trade-off,
  • this is a time-series problem and should be dealt as such - read something about specifics of machine learning prediction for time-series data,
  • you should preprocess your data before fitting any model (this possibly includes but is not limited to normalization/standardization, feature encoding, feature generation, dimensionality reduction, adding external data, time series specific preprocessing, ...)
  • you should try more different models and grid-search for the best hyperparameters.

Of course there is much more than this and if you are new to machine learning, it would be a good idea to read an introductory book or take a course so that you get a basic overview and a starting point for a further study. For example this is an excelent course.

PGlivi
  • 996
  • 9
  • 12