How to create a model on time series data and update it?

Question

I have a large dataset of 23k rows. That data looks like something below:

import pandas as pd

d = {'Date': ["1-1-2020", '1-1-2020', "1-2-2020", "1-2-2020"], 'Stock': ["FB", "F", "FB", "F"], 
     "last_price": [230,8,241,9], "price":[241,9,240,8.5]}
df = pd.DataFrame(data=d)

    Date      Stock_id   last_price  price
0   1-1-2020  5           230        241.0
1   1-1-2020  41          8          9.0
2   1-2-2020  5           241        240.0
3   1-2-2020  41          9          8.5

Note that data includes many stocks on many different dates. How can I create a model that uses the feature for example last_price and stock id to predict next-day price? And that uses the old data to re-train the data.

Now, this was the best thing I could do. I used LinearRegression but any other model advice can work.

X = df[['Stock_id', 'last_price']]
y = df[['price']]

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn import linear_model  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
lm = linear_model.LinearRegression()
lm.fit(X_train, y_train)

y_pred = lm.predict(X_test)
result = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})

Index   Actual  Predicted
487     45      32
4154    420     512

Is there a way where the model is trained on the first 3000 rows? Then the model makes a prediction for say date 12-11-2020 and then adds 12-11-2020 info to make the prediction for 12-12-2020 and so on?

I was hoping to get something like this.

Date       Actual   Predicted
12-11-2020  45      32
12-11-2020  420     512
12-12-2020  43      34
12-12-2020  423     513

Please refrain from apologizing, thanking etc in SO posts (edited out); stick only to the point. Plus, question has nothing to do with `tensorflow` - kindly do not spam irrelevant tags (edited out). — desertnaut, Sep 25 '21 at 22:24

score 1 · Answer 1 · edited Sep 25 '21 at 22:23

I don't think having the id in your training dataset is appropriate since ids and comparing them does not give any useable information and may result in a bad calculated linear function for your model. ID just signifies that you are talking about a specific stock and is constant for a specific stock in the whole dataset. Also the value of the Stock_id cannot does not have any meaning that can be used for comparing stocks together, for example having a Stock_id = 1 and Stock_id = 2 doesn't mean these 2 are closer together than Stock_id = 1 and Stock_id = 100, they are just names. So I think you should split your original dataset based on the Stock_id and only include last_price in each of these new training datasets (X). You can do that in several ways, one them being the groupby function of pandas:

grouped = df.groupby(df.Stock_id)
stock_1= grouped.get_group(1)

After that, you can use a for loop on the unique value of your Stock_id column to get all the ids and their dataframes. Then you define a regression model for each of these new datasets and use the fit method to train it.

To retrain or update your regression model, LinearRegression does not support partial fit and I think you need to use the fit method again each time you want to update your model. You can use the first N rows of each user to fit the model, then predict the value for the next last_price and add the predicted value to the N rows and re-fit the model on the extended dataset. However, if your model actually calculates a good line to predict the data, I don't think you will see that much of a difference by adding new predictions to the training dataset.

Another option is to use SGDRegressor instead of LinearRegression, since it has a partial_fit() method allows for incremental training which lets you train your model on new data without re-training the model on the whole dataset. You can find the documentation for this model here. Also this answer here explains the difference between SGDRegressor and Linear Regression.

If you still want to use LinearRegression and retrain the model, I suggest you use batches of data for updating your model, instead of retraining it on each new predicted value. You can wait for your predicted values to get to a certain number, for example 10, and then add these 10 new values to your training dataset and retrain the model just once. This answer here explains 3 approaches in retraining the model which might be useful for you.

How to create a model on time series data and update it?

1 Answers1