1

I'm trying to perform true out-of-sample forecasting in Python. I've been researching for several days with no luck.

I came across the sample code shown below for stock price forecasting which I am trying to modify to predict temperature change caused by a thermochemical process (time series problem). As I understand it, the sample code shifts the historical dataset (say 100 datapoints) by 'n' days then splits the remaining datapoints into two set for training (80%) and testing (20%) then it goes on to predict/estimate the stock values for the predetermined 'n' days.

Is it possible to modify this code to forecast true out-of-sample dependent variables which are outside the historical dataset?

Thank you for your help.

from pandas_datareader import data
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df = data.DataReader('FB', 'yahoo', start= '2015-01-01', end='2020-04-27')

df = df[['Close']]

print (df.tail())

# variable for predicting 'n' days out in the future
forecast = 1

# create another column called prediction that is shifted n days out
df['predicted'] = df[['Close']].shift(-forecast)

# Convert the dataframe to numpy array
X = np.array(df.drop(['predicted'],1))

# Remove the last n rows
X = X[:-forecast]

# Create the dependent dataset 
y = np.array(df['predicted'])

# Get all the y values except the last n rows
y = y[:-forecast]

# Split data into %training and %testing
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# Create and train the linear regression model
lr = LinearRegression()
lr.fit(x_train, y_train)

# Testing the model using score (returns the coefficient of determination R^2)
lr_score = lr.score(x_test, y_test)

# Create x_forecast equals to the last n rows of the original dataset from the close column
x_forecast = np.array(df.drop(['predicted'],1))[-forecast:]

lr_prediction = lr.predict(x_forecast)

print (lr_score)

print (lr_prediction)
  • Question is a bit unclear. What do you mean by 101th dependent variable? – DerekG Apr 27 '20 at 14:05
  • 1
    I just revised the question. I need the code to forecast the dependent variable value for the day after the last day in the historical dataset. – Frank Abraham Apr 27 '20 at 14:20

2 Answers2

1

Basically, this is a plain vanilla machine learning task known as linear regression in which a function (linear, quadratic, doesn't matter really) is fit to a dataset. In machine learning tasks, you are trying to predict the label for an example. An example is one piece of data, the features of the example are the attributes of the data point that you know, and the label of the example is the attribute of the data that you are trying to predict. Out-of-sample forecasting is well explained here but in machine learning terms you fit your model to a partition of the data that you have known as the training set (In-sample forecasting). You then test the model's ability to generalize by predicting the label for the other partition of the data, known as the testing set (out-of-sample forecasting). It is of course important that your model is not trained on the testing set or your results for out-of-sample generalization will be biased and artificially good.

Given these machine learning terms, you should be able to perform straightforward linear regression as described here or on any number of blog posts online.

DerekG
  • 3,555
  • 1
  • 11
  • 21
0

Not quite sure what you are asking. Ran the code you provided and it simply creates a linear equation that predicts the next value given the previous days value.

lr                         #Linear equation that was calculated from the data. 
input  = X[0:10]           #Input is 10 different points. 
output = lr.predict(input) #Output is the 10 points that are predicted from the input. 

I believe the code you provided is already doing the "next day prediction" that you are looking for.

If you are going to be predicting on heat type data, make sure to fit to an exponential function, as a linear function is probably going to be less accurate.

Bobby Ocean
  • 3,120
  • 1
  • 8
  • 15
  • Thanks for the tip regarding the exponential function. If you increase the value of 'forecast' to say 10, the code will predict the value going back 10 days in time which is pointless. I need it to forecast the value 10 days (or n days) in the future. – Frank Abraham Apr 27 '20 at 14:34
  • What? That is not true, just print df['Close'][:10] and df['predicted'][:10]. You can clearly see that input zero (for example), is 78.449997 and output is 77.190002 which is also the "NEXT" days value, not the previous days. – Bobby Ocean Apr 27 '20 at 14:45
  • Ok. so how do I get the code to predict the stock value for say May 27, 2020 in this case? – Frank Abraham Apr 27 '20 at 14:54
  • Note that value X[-1] is April 24th. You can see this in print(df['Close'][-5:]). Lets run a few values through the lr, print(x[-5:]); print(lr.predict(X[-5:])). You can see the inputs of X and the predicted values for those X's. The last value would be April 27th, 2020. – Bobby Ocean Apr 27 '20 at 15:04