I'm trying to perform true out-of-sample forecasting in Python. I've been researching for several days with no luck.
I came across the sample code shown below for stock price forecasting which I am trying to modify to predict temperature change caused by a thermochemical process (time series problem). As I understand it, the sample code shifts the historical dataset (say 100 datapoints) by 'n' days then splits the remaining datapoints into two set for training (80%) and testing (20%) then it goes on to predict/estimate the stock values for the predetermined 'n' days.
Is it possible to modify this code to forecast true out-of-sample dependent variables which are outside the historical dataset?
Thank you for your help.
from pandas_datareader import data
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
df = data.DataReader('FB', 'yahoo', start= '2015-01-01', end='2020-04-27')
df = df[['Close']]
print (df.tail())
# variable for predicting 'n' days out in the future
forecast = 1
# create another column called prediction that is shifted n days out
df['predicted'] = df[['Close']].shift(-forecast)
# Convert the dataframe to numpy array
X = np.array(df.drop(['predicted'],1))
# Remove the last n rows
X = X[:-forecast]
# Create the dependent dataset
y = np.array(df['predicted'])
# Get all the y values except the last n rows
y = y[:-forecast]
# Split data into %training and %testing
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
# Create and train the linear regression model
lr = LinearRegression()
lr.fit(x_train, y_train)
# Testing the model using score (returns the coefficient of determination R^2)
lr_score = lr.score(x_test, y_test)
# Create x_forecast equals to the last n rows of the original dataset from the close column
x_forecast = np.array(df.drop(['predicted'],1))[-forecast:]
lr_prediction = lr.predict(x_forecast)
print (lr_score)
print (lr_prediction)