I am working on a custom project where I am trying to predict baseball batting and pitching stats for all players within my dataset from 1970 - 2022. For simplicity and to reduce potential clutter I am only going to refer to my batting dataset. After cleanup of my dataset it is 26768 rows × 33 columns.
I was wanting to push myself to learn something new so I decided to go with a RNN model.
GOAL of Project: To predict 5 stats for each player starting at their 3rd season through their last season in the league.
Sneak Peek into issue:
ValueError: cannot reshape array of size 36630 into shape (1,33,20)
First I will provide a bit of background in case that may help in review of my issue
I used Sequential Feature Selection within a ridge regression to obtain my predictors for each stat:
rr = Ridge(alpha=1)
split = TimeSeriesSplit(n_splits=3)
bat_sfs_ba = SequentialFeatureSelector(rr, n_features_to_select=20, cv=split, n_jobs= -1)
bat_sfs_rbi = SequentialFeatureSelector(rr, n_features_to_select=20, cv=split, n_jobs= -1)
bat_sfs_hr = SequentialFeatureSelector(rr, n_features_to_select=20, cv=split, n_jobs= -1)
bat_sfs_bb = SequentialFeatureSelector(rr, n_features_to_select=20, cv=split, n_jobs= -1)
bat_sfs_so = SequentialFeatureSelector(rr, n_features_to_select=20, cv=split, n_jobs= -1)
Scaled Data:
scaler = MinMaxScaler()
batting.loc[:, bat_cols] = scaler.fit_transform(batting[bat_cols])
pitching.loc[:, pitch_cols] = scaler.fit_transform(pitching[pitch_cols])
Fit Data:
bat_sfs_ba.fit(batting[bat_cols], batting['Nxt_BA'])
bat_sfs_rbi.fit(batting[bat_cols], batting['Nxt_RBI'])
bat_sfs_hr.fit(batting[bat_cols], batting['Nxt_HR'])
bat_sfs_bb.fit(batting[bat_cols], batting['Nxt_BB'])
bat_sfs_so.fit(batting[bat_cols], batting['Nxt_SO'])
Obtained list of predictors:
bat_ba_preds = list(bat_cols[bat_sfs_ba.get_support()])
bat_rbi_preds = list(bat_cols[bat_sfs_rbi.get_support()])
bat_hr_preds = list(bat_cols[bat_sfs_hr.get_support()])
bat_bb_preds = list(bat_cols[bat_sfs_bb.get_support()])
bat_so_preds = list(bat_cols[bat_sfs_so.get_support()])
Example of my batting average predictors:
['Age','G','PA','AB','R','H','2B','3B','HR','RBI','CS','BB','SO','OBP','OPS','TB','GDP','SH','SF','IBB']
Imports for model:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
Building Multivariate time series LSTM model within function:
def bat_ba_mrnn (data, model, predictors, start=2, step=1):
bat_preds = []
seasons = sorted(data["Year"].unique())
for i in range(start, len(seasons), step):
current_season = seasons[i]
train = data[data['Year'] < current_season]
test = data[data['Year'] == current_season]
model = Sequential()
train = train.values.reshape(1, 33, 20)
model.add(LSTM(units = 175, return_sequences = True, input_shape = (train)))
model.add(Dropout(0.25))
model.add(LSTM(units = 142, return_sequences = True, dropout=0.2,recurrent_dropout=0.15))
model.add(LSTM(units = 125, return_sequences = True, dropout=0.2,recurrent_dropout=0.15))
model.add(LSTM(units = 100, return_sequences = True, dropout=0.2,recurrent_dropout=0.15))
model.add(LSTM(units = 75, return_sequences= True, dropout=0.2, recurrent_dropout=0.15))
model.add(LSTM(units = 75, return_sequences= True, dropout=0.2, recurrent_dropout=0.15))
model.add(LSTM(units = 50, return_sequences= True, dropout=0.2, recurrent_dropout=0.15))
model.add(LSTM(units = 50, return_sequences= True, dropout=0.2, recurrent_dropout=0.15))
model.add(LSTM(units = 50, return_sequences= False))
model.add(Dense(units = 1))
model.compile(optimizer = 'adam', loss = 'mean_squared_error')
model.fit(train[predictors], train['Nxt_BA'])
preds = model.predict(test[predictors])
preds = pd.Series(preds, index=test.index)
together = pd.concat([test['Nxt_BA'], preds], axis=1)
together.columns = ['actual', 'prediction']
bat_preds.append(together)
return pd.concat(bat_preds)
I initially was getting an error on the shape being 2dim when expecting 3dim, so I reshaped it to what is shown above and now when I run this:
bat_ba_predictions = bat_ba_mrnn(batting, LSTM, bat_ba_preds)
It is giving me this error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [160], in <cell line: 1>()
----> 1 bat_ba_predictions = bat_ba_mrnn(batting, LSTM, bat_ba_preds)
Input In [159], in bat_ba_mrnn(data, model, predictors, start, step)
13 test = data[data['Year'] == current_season]
15 model = Sequential()
---> 17 train = train.values.reshape(1, 33, 20)
19 model.add(LSTM(units = 175, return_sequences = True, input_shape = (train)))
20 model.add(Dropout(0.25))
ValueError: cannot reshape array of size 36630 into shape (1,33,20)
I have tried many different options but have not been able to figure out how to properly rehape this so that this will work.
=====
UPDATE:
After looking around some more I believe that padding the array with zeros may resolve my reshape issue so after some research I added:
zeros = np.zeros((2,20))
zeros[:train.shape[0],:train.shape[1]] = train
I also have adjusted the first layer of the LSTM to the below and removed the reshape line since after additional research I found that I did not have to convert it to 3dim an could stay as 2dim...if I understood correctly:
model.add(LSTM(units = 175, return_sequences = True, input_shape = (33,20)))
and while it seems like it might have as I am no longer receiving a reshape error I am now receiving the below error:
ValueError: could not convert string to float: 'Alan\xa0Foster'
It seems like it is now including the string columns into this calculation. I tried removing all of the string columns, but then I got an error:
IndexError: tuple index out of range
I am not sure of how to get past all of these errors