7

I am trying to figure out how to structure my dataset and build the X and y such that it will work with Keras' Stacked LSTM for sequence classification.

I have panel data where I am trying to predict classifications. I am not entirely sure how to understand timesteps or how to properly craft data's shape given my panel data.

# Libraries
from keras.models import Sequential
from keras.layers import LSTM, Dense
import numpy as np
import pandas as pd

# Here is an example of my data
df = pd.read_csv('https://raw.githubusercontent.com/rocketfish88/democ/master/sample2.csv')
df
# Contains a handful of features, a target, year, and id of the observation
   id        year  x1 x2  x3  y
0   A       2015   1   1   1  1
1   A       2016   2   2   2  1
2   A       2017   3   3   3  2
3   A       2018   4   4   4  2
4   B       2015   1   1   1  3
5   B       2016   2   2   2  2
6   B       2017   3   3   3  1
7   B       2018   4   4   4  1
8   C       2015   1   1   1  2
9   C       2016   2   2   2  2
10  C       2017   3   3   3  3
11  C       2018   4   4   4  2

Keras.io presents the following with example:

data_dim = 16
timesteps = 8
num_classes = 10

# expected input data shape: (batch_size, timesteps, data_dim)
model = Sequential()
model.add(LSTM(32, return_sequences=True,
               input_shape=(timesteps, data_dim)))  # returns a sequence of vectors of dimension 32
model.add(LSTM(32, return_sequences=True))  # returns a sequence of vectors of dimension 32
model.add(LSTM(32))  # return a single vector of dimension 32
model.add(Dense(10, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

# Generate dummy training data
x_train = np.random.random((1000, timesteps, data_dim))
y_train = np.random.random((1000, num_classes))

# Generate dummy validation data
x_val = np.random.random((100, timesteps, data_dim))
y_val = np.random.random((100, num_classes))

model.fit(x_train, y_train,
          batch_size=64, epochs=5,
          validation_data=(x_val, y_val))

I am fairly lost as to how to take my dataset and transform it into the proper shape of (size, timesteps, dimensions)

I appreciate any help!

John Stud
  • 1,506
  • 23
  • 46
  • LSTM is good for modelling sequences, it's unclear how your data is a sequential classification task? You can just use a feed-forward model with the features as input and y as output. – nuric Feb 02 '19 at 20:20
  • 1
    My data is time series cross sectional, or panel, which means we have a sequence of repeated inputs on the groups of units over time. Feedfoward examples do not seem to have a time component or repeated observation on the same entity. – John Stud Feb 02 '19 at 20:30
  • If it is a fixed number of observations, you can flatten it such that it becomes `(batch_size, timesteps*features)` to use a feed-forward which is a good thing to try. Otherwise, it is not clear as to what your timesteps in the data correspond to from the question. – nuric Feb 02 '19 at 21:53
  • Interesting! Could you please offer some additional advice on the `batch_size` and how to translate my current dataset to fit the LTSM expectations with `timesteps*features`? From reading about this, it seems `batch_size` means the number of sequences trained together. So I am not sure if I train all the year `2014` entries together or all of unit `A` entries together. Then, how would I go about computing `timesteps*features`? It doesnt make sense to multiply the timestep by x1, x2, etc – John Stud Feb 03 '19 at 00:25

0 Answers0