Multivariate time series RNN (LSTM) issues for player stat predictions

Question

I am working on a custom project where I am trying to predict baseball batting and pitching stats for all players within my dataset from 1970 - 2022. For simplicity and to reduce potential clutter I am only going to refer to my batting dataset. After cleanup of my dataset it is 26768 rows × 33 columns.

I was wanting to push myself to learn something new so I decided to go with a RNN model.

GOAL of Project: To predict 5 stats for each player starting at their 3rd season through their last season in the league.

Sneak Peek into issue:

ValueError: cannot reshape array of size 36630 into shape (1,33,20)

First I will provide a bit of background in case that may help in review of my issue

I used Sequential Feature Selection within a ridge regression to obtain my predictors for each stat:

rr = Ridge(alpha=1)

split = TimeSeriesSplit(n_splits=3)

bat_sfs_ba = SequentialFeatureSelector(rr, n_features_to_select=20, cv=split, n_jobs= -1) 
bat_sfs_rbi = SequentialFeatureSelector(rr, n_features_to_select=20, cv=split, n_jobs= -1) 
bat_sfs_hr = SequentialFeatureSelector(rr, n_features_to_select=20, cv=split, n_jobs= -1) 
bat_sfs_bb = SequentialFeatureSelector(rr, n_features_to_select=20, cv=split, n_jobs= -1) 
bat_sfs_so = SequentialFeatureSelector(rr, n_features_to_select=20, cv=split, n_jobs= -1)

Scaled Data:

scaler = MinMaxScaler()
batting.loc[:, bat_cols] = scaler.fit_transform(batting[bat_cols])
pitching.loc[:, pitch_cols] = scaler.fit_transform(pitching[pitch_cols])

Fit Data:

bat_sfs_ba.fit(batting[bat_cols], batting['Nxt_BA'])
bat_sfs_rbi.fit(batting[bat_cols], batting['Nxt_RBI'])
bat_sfs_hr.fit(batting[bat_cols], batting['Nxt_HR'])
bat_sfs_bb.fit(batting[bat_cols], batting['Nxt_BB'])
bat_sfs_so.fit(batting[bat_cols], batting['Nxt_SO'])

Obtained list of predictors:

bat_ba_preds = list(bat_cols[bat_sfs_ba.get_support()])
bat_rbi_preds = list(bat_cols[bat_sfs_rbi.get_support()])
bat_hr_preds = list(bat_cols[bat_sfs_hr.get_support()])
bat_bb_preds = list(bat_cols[bat_sfs_bb.get_support()])
bat_so_preds = list(bat_cols[bat_sfs_so.get_support()])

Example of my batting average predictors:

['Age','G','PA','AB','R','H','2B','3B','HR','RBI','CS','BB','SO','OBP','OPS','TB','GDP','SH','SF','IBB']

Imports for model:

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout

Building Multivariate time series LSTM model within function:

def bat_ba_mrnn (data, model, predictors, start=2, step=1):
    bat_preds = []
    
    seasons = sorted(data["Year"].unique())
    
    for i in range(start, len(seasons), step):
        current_season = seasons[i]
        train = data[data['Year'] < current_season]
        test = data[data['Year'] == current_season]
        
        model = Sequential()
        
        train = train.values.reshape(1, 33, 20)
        
        model.add(LSTM(units = 175, return_sequences = True, input_shape = (train)))
        model.add(Dropout(0.25))
        model.add(LSTM(units = 142, return_sequences = True, dropout=0.2,recurrent_dropout=0.15))
        model.add(LSTM(units = 125, return_sequences = True, dropout=0.2,recurrent_dropout=0.15))
        model.add(LSTM(units = 100, return_sequences = True, dropout=0.2,recurrent_dropout=0.15))
        model.add(LSTM(units = 75, return_sequences= True, dropout=0.2, recurrent_dropout=0.15))
        model.add(LSTM(units = 75, return_sequences= True, dropout=0.2, recurrent_dropout=0.15))
        model.add(LSTM(units = 50, return_sequences= True, dropout=0.2, recurrent_dropout=0.15))
        model.add(LSTM(units = 50, return_sequences= True, dropout=0.2, recurrent_dropout=0.15))
        model.add(LSTM(units = 50, return_sequences= False))
        model.add(Dense(units = 1))             
        
        
        model.compile(optimizer = 'adam', loss = 'mean_squared_error')
        model.fit(train[predictors], train['Nxt_BA'])
        
        preds = model.predict(test[predictors]) 
        preds = pd.Series(preds, index=test.index)
        together = pd.concat([test['Nxt_BA'], preds], axis=1)
        together.columns = ['actual', 'prediction']
        
        bat_preds.append(together)
    return pd.concat(bat_preds)

I initially was getting an error on the shape being 2dim when expecting 3dim, so I reshaped it to what is shown above and now when I run this:

bat_ba_predictions = bat_ba_mrnn(batting, LSTM, bat_ba_preds)

It is giving me this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [160], in <cell line: 1>()
----> 1 bat_ba_predictions = bat_ba_mrnn(batting, LSTM, bat_ba_preds)

Input In [159], in bat_ba_mrnn(data, model, predictors, start, step)
     13 test = data[data['Year'] == current_season]
     15 model = Sequential()
---> 17 train = train.values.reshape(1, 33, 20)
     19 model.add(LSTM(units = 175, return_sequences = True, input_shape = (train)))
     20 model.add(Dropout(0.25))

ValueError: cannot reshape array of size 36630 into shape (1,33,20)

I have tried many different options but have not been able to figure out how to properly rehape this so that this will work.

=====

UPDATE:

After looking around some more I believe that padding the array with zeros may resolve my reshape issue so after some research I added:

    zeros = np.zeros((2,20))
    zeros[:train.shape[0],:train.shape[1]] = train

I also have adjusted the first layer of the LSTM to the below and removed the reshape line since after additional research I found that I did not have to convert it to 3dim an could stay as 2dim...if I understood correctly:

    model.add(LSTM(units = 175, return_sequences = True, input_shape = (33,20)))

and while it seems like it might have as I am no longer receiving a reshape error I am now receiving the below error:

ValueError: could not convert string to float: 'Alan\xa0Foster'

It seems like it is now including the string columns into this calculation. I tried removing all of the string columns, but then I got an error:

IndexError: tuple index out of range

I am not sure of how to get past all of these errors

Why are you reshaping an array of size `36630` into `(1,33,20)`? `1*33*20` gives `660`, i.e. it requires the array to have precisely 660 elements. — adrianop01, Apr 11 '23 at 06:50
@adrianop01 I initially was just adding the 1 to bring in a 3rd dimension. I have tried different options but the 1*33*20 is just what I ended up copying in. My issue is that to obtain the size of 36630 I would need 55.5*33*20 and float causes an issue still. I am not sure how I could reshape this properly to remove the error, or if other changes need to be done in the code to avoid this issue. — DJB17, Apr 11 '23 at 17:53
you want ```(-1,33,20)```. https://numpy.org/doc/stable/reference/generated/numpy.reshape.html#numpy.reshape — adrianop01, Apr 12 '23 at 05:41
@adrianop01 I tried that but am receiving this error"ValueError: cannot reshape array of size 36630 into shape (33,20)" I have been looking further online and am wondering if resizing the array with zeros will resolve it. I don't have experience with this so I am not sure how to do it. I have tried a few things I have seen but none of them have worked so far. — DJB17, Apr 12 '23 at 21:40
I added an update to my submission to update on other adjustments I have made, but I am still dealing with issues I am unsure how to resolve. — DJB17, Apr 13 '23 at 01:55
I agree with your conclusion about padding. May I have the link/name of your dataset? — adrianop01, Apr 13 '23 at 12:49
@adrianop01 My dataset is not available online as part of my project I wanted to learn how to build a web scraper and i collected the data from the baseball reference website. I uploaded my dataset to my github along with a copy of my project files if that may help? I apologize if it does not as I am pretty green in regards to github https://github.com/DJBrito17/DTSC-691-Custom-ML-Project If that does not work or help is there a way for me to send you the dataset? — DJB17, Apr 13 '23 at 20:13

Multivariate time series RNN (LSTM) issues for player stat predictions

0 Answers0