41

I am trying to implement an LSTM with Keras.

I know that LSTM's in Keras require a 3D tensor with shape (nb_samples, timesteps, input_dim) as an input. However, I am not entirely sure how the input should look like in my case, as I have just one sample of T observations for each input, not multiple samples, i.e. (nb_samples=1, timesteps=T, input_dim=N). Is it better to split each of my inputs into samples of length T/M? T is around a few million observations for me, so how long should each sample in that case be, i.e., how would I choose M?

Also, am I right in that this tensor should look something like:

[[[a_11, a_12, ..., a_1M], [a_21, a_22, ..., a_2M], ..., [a_N1, a_N2, ..., a_NM]], 
 [[b_11, b_12, ..., b_1M], [b_21, b_22, ..., b_2M], ..., [b_N1, b_N2, ..., b_NM]], 
 ..., 
 [[x_11, x_12, ..., a_1M], [x_21, x_22, ..., x_2M], ..., [x_N1, x_N2, ..., x_NM]]]

where M and N defined as before and x corresponds to the last sample that I would have obtained from splitting as discussed above?

Finally, given a pandas dataframe with T observations in each column, and N columns, one for each input, how can I create such an input to feed to Keras?

MBT
  • 21,733
  • 19
  • 84
  • 102
dreamer
  • 1,192
  • 5
  • 20
  • 42
  • Could you add an example dataset to you question, please? Because it's not clear which sequence of inputs will create what kind of target output in your model. – mertyildiran Oct 04 '16 at 16:35
  • Can you explain what the format or data type is for one observation? Is it a single numerical value, a set of values, or something else? – Andrew Oct 07 '16 at 13:48

2 Answers2

57

Below is an example that sets up time series data to train an LSTM. The model output is nonsense as I only set it up to demonstrate how to build the model.

import pandas as pd
import numpy as np
# Get some time series data
df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/timeseries.csv")
df.head()

Time series dataframe:

Date      A       B       C      D      E      F      G
0   2008-03-18  24.68  164.93  114.73  26.27  19.21  28.87  63.44
1   2008-03-19  24.18  164.89  114.75  26.22  19.07  27.76  59.98
2   2008-03-20  23.99  164.63  115.04  25.78  19.01  27.04  59.61
3   2008-03-25  24.14  163.92  114.85  27.41  19.61  27.84  59.41
4   2008-03-26  24.44  163.45  114.84  26.86  19.53  28.02  60.09

You can build put inputs into a vector and then use pandas .cumsum() function to build the sequence for the time series:

# Put your inputs into a single list
df['single_input_vector'] = df[input_cols].apply(tuple, axis=1).apply(list)
# Double-encapsulate list so that you can sum it in the next step and keep time steps as separate elements
df['single_input_vector'] = df.single_input_vector.apply(lambda x: [list(x)])
# Use .cumsum() to include previous row vectors in the current row list of vectors
df['cumulative_input_vectors'] = df.single_input_vector.cumsum()

The output can be set up in a similar way, but it will be a single vector instead of a sequence:

# If your output is multi-dimensional, you need to capture those dimensions in one object
# If your output is a single dimension, this step may be unnecessary
df['output_vector'] = df[output_cols].apply(tuple, axis=1).apply(list)

The input sequences have to be the same length to run them through the model, so you need to pad them to be the max length of your cumulative vectors:

# Pad your sequences so they are the same length
from keras.preprocessing.sequence import pad_sequences

max_sequence_length = df.cumulative_input_vectors.apply(len).max()
# Save it as a list   
padded_sequences = pad_sequences(df.cumulative_input_vectors.tolist(), max_sequence_length).tolist()
df['padded_input_vectors'] = pd.Series(padded_sequences).apply(np.asarray)

Training data can be pulled from the dataframe and put into numpy arrays. Note that the input data that comes out of the dataframe will not make a 3D array. It makes an array of arrays, which is not the same thing.

You can use hstack and reshape to build a 3D input array.

# Extract your training data
X_train_init = np.asarray(df.padded_input_vectors)
# Use hstack to and reshape to make the inputs a 3d vector
X_train = np.hstack(X_train_init).reshape(len(df),max_sequence_length,len(input_cols))
y_train = np.hstack(np.asarray(df.output_vector)).reshape(len(df),len(output_cols))

To prove it:

>>> print(X_train_init.shape)
(11,)
>>> print(X_train.shape)
(11, 11, 6)
>>> print(X_train == X_train_init)
False

Once you have training data you can define the dimensions of your input layer and output layers.

# Get your input dimensions
# Input length is the length for one input sequence (i.e. the number of rows for your sample)
# Input dim is the number of dimensions in one input vector (i.e. number of input columns)
input_length = X_train.shape[1]
input_dim = X_train.shape[2]
# Output dimensions is the shape of a single output vector
# In this case it's just 1, but it could be more
output_dim = len(y_train[0])

Build the model:

from keras.models import Model, Sequential
from keras.layers import LSTM, Dense

# Build the model
model = Sequential()

# I arbitrarily picked the output dimensions as 4
model.add(LSTM(4, input_dim = input_dim, input_length = input_length))
# The max output value is > 1 so relu is used as final activation.
model.add(Dense(output_dim, activation='relu'))

model.compile(loss='mean_squared_error',
              optimizer='sgd',
              metrics=['accuracy'])

Finally you can train the model and save the training log as history:

# Set batch_size to 7 to show that it doesn't have to be a factor or multiple of your sample size
history = model.fit(X_train, y_train,
              batch_size=7, nb_epoch=3,
              verbose = 1)

Output:

Epoch 1/3
11/11 [==============================] - 0s - loss: 3498.5756 - acc: 0.0000e+00     
Epoch 2/3
11/11 [==============================] - 0s - loss: 3498.5755 - acc: 0.0000e+00     
Epoch 3/3
11/11 [==============================] - 0s - loss: 3498.5757 - acc: 0.0000e+00 

That's it. Use model.predict(X) where X is the same format (other than the number of samples) as X_train in order to make predictions from the model.

Andrew
  • 1,698
  • 14
  • 15
  • 1
    This is great, exactly what I needed! Thanks very much! One thing that isn't entirely clear to me is what you mean with the output dimension. On the one hand you say that "In this case it's just 1, but it could be more " while on the other hand you say that "I arbitrarily picked the output dimensions as 4". Is the output dimension just the number of columns in y (i.e. not the number of observations, but the number of *variables* that you're trying to predict at the same time)? Why could you choose 4 here then while it's actually 1? – dreamer Oct 13 '16 at 19:28
  • 1
    Wish I could've given you the bounty before it expired btw, you would have deserved it. Really appreciate your answer a lot! Struggled immensely with this. – dreamer Oct 13 '16 at 19:29
  • I'm talking about 2 different outputs: the hidden layer output and the final output. The network I built actually has 2 layers (not counting the input vectors as a layer). There is the LSTM layer and a Dense layer. The LSTM is your hidden layer. The vectors that are passed out of the LSTM layer have 4 dimensions, but you can theoretically choose any number, as the subsequent Dense layer with accept a vector of that shape as its input. The final output (i.e. your y's) is a single number in this case but could be a vector of n-dimensions, which is why I said it could be more. – Andrew Oct 13 '16 at 19:46
  • Ah okay, that makes sense. Thank you again for everything, really highly appreciated :)! – dreamer Oct 13 '16 at 20:58
  • When I predict with a testset that contains less observations than the trainset I get: `expected lstm_input_1 to have shape (None, 405, 13) but got array with shape (102, 102, 13)` [405 is the length of my train set, 13 is my number of X inputs, 102 is the length of my testset]. I generated the test X in the same way as you did for the train X. Do you know what I'm doing wrong? – dreamer Oct 15 '16 at 21:01
  • I suspect that the second argument of the test triplet should be 405, but I'm just not sure if that makes sense given that its number of oberservations is 102, and also I don't know how I can shape it to that format. – dreamer Oct 15 '16 at 21:19
  • Make sure you set up your dataframe before you split into testing and training data. Splitting should be your last step before you build the model. Also, Keras has a built-in validation. If you add `validation_split = 0.25` to your `fit()` call then Keras will automatically use 25% of your records as validation data. If you do that you don't have to split your data at all. – Andrew Oct 18 '16 at 12:54
  • awesome answer! just one question, I am trying with different data and all is well but when you do `# Save it as a list padded_sequences = pad_sequences(df.cumulative_input_vectors.tolist(), max_sequence_length).tolist() df['padded_input_vectors'] = pd.Series(padded_sequences).apply(np.asarray)` my padded sequence is all zeros while the cumulative has the right values? I tried with different sizes of max_sequence_length – lorenzori Apr 10 '17 at 12:44
  • This might be a stupid question but are you sure that the output from the padded sequence is *all* zeroes? By default `pad_sequences` will put the actual values at the end of the array and prepend the zeroes to the actual data. – Andrew Apr 10 '17 at 13:46
  • let me double check if it is actually all zeros, but qualitatively they were! – lorenzori Apr 10 '17 at 14:49
  • so yes it returns 0s for almost all of them, I am struggling to understand what is doing. Below an example of `df.loc[0,['cumulative_input_vectors', 'padded_input_vectors']]`: `cumulative_input_vectors [[0.00154294032023, 0.992925949858, 1.0, 0.408... padded_input_vectors [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]] Name: 0, dtype: object` – lorenzori Apr 11 '17 at 07:30
  • I'm not sure I'm getting the input data right.. I guess A, B, C etc are different timeseries, right? So, in this case your output shouldn't be a prediction for each one of them ? Why did you set as output the last column (i.e another timeseries) ? – Mewtwo Aug 01 '17 at 09:41
  • @Andrew, thanks, you made me understand how it should be. But how do I do this with a large dataframe of 1600 row and 2400 columns? when I run the `pad_sequences`, specifically the `df.cumulative_input_vectors.tolist()` It gives Memory error. Also when only running the tolist() it took nearly 1 hour running and consumed up to 12gb or ram that I had to cancel. What do? – mrbTT Jan 24 '19 at 16:45
  • 1
    Fantastic post. This is a very elegant trick. Can you can use the `rolling` object to create a rolling window in a similar way? I need a rolling window instead of a cumulative sum. – John Strong Jan 30 '19 at 05:13
  • yeah, how would you do the time step of say 3 instead of a cumulative sum? Also I am getting all 0's with the padded_vectors step when converting my input_vector – dasvootz Oct 20 '20 at 03:36
  • 1
    "The input sequences have to be the same length to run them through the model, so you need to pad them to be the max length of your cumulative vectors": This uses up a bunch of memory when the input data has e.g. 128 floating point features, 4000 timesteps, and 1 sample (i.e. `(1, 4000, 128)`). – There Apr 08 '22 at 21:02
  • It's been a long time since this thread was active, but some folks are commenting that the cumsum trick leads to very long vectors when you have thousands of samples. You don't need to make a vector the length of the entire dataset, just use a window. For a window of 4 periods long use something like ```df['cumulative_input_vectors'] = [df.loc[max(0,idx-3):idx,].single_input_vector.sum() for idx, row in df.iterrows()]``` – HaplessEcologist Nov 10 '22 at 23:06
7

Tensor shape

You're right that Keras is expecting a 3D tensor for an LSTM neural network, but I think the piece you are missing is that Keras expects that each observation can have multiple dimensions.

For example, in Keras I have used word vectors to represent documents for natural language processing. Each word in the document is represented by an n-dimensional numerical vector (so if n = 2 the word 'cat' would be represented by something like [0.31, 0.65]). To represent a single document, the word vectors are lined up in sequence (e.g. 'The cat sat.' = [[0.12, 0.99], [0.31, 0.65], [0.94, 0.04]]). A document would be a single sample in a Keras LSTM.

This is analogous to your time series observations. A document is like a time series, and a word is like a single observation in your time series, but in your case it's just that the representation of your observation is just n = 1 dimensions.

Because of that, I think your tensor should be something like [[[a1], [a2], ... , [aT]], [[b1], [b2], ..., [bT]], ..., [[x1], [x2], ..., [xT]]], where x corresponds to nb_samples, timesteps = T, and input_dim = 1, because each of your observations is only one number.

Batch size

Batch size should be set to maximize throughput without exceeding the memory capacity on your machine, per this Cross Validated post. As far as I know your input does not need to be a multiple of your batch size, neither when training the model and making predictions from it.

Examples

If you're looking for sample code, on the Keras Github there are a number of examples using LSTM and other network types that have sequenced input.

Community
  • 1
  • 1
Andrew
  • 1,698
  • 14
  • 15
  • Thanks for the answer. I find it hard to actually get the data in the shape that you describe, starting from a Pandas dataframe. And as for the batch size, I believe that Keras does require it to be a multiple of `nb_samples`, as I have seen it throw error messages about this, which makes things a lot harder. I have seen the examples that you link to before, but they are not really for timeseries and multiple inputs, and that really does make it a lot harder (you'll see it when you try it). Would you mind giving me an example, even if it's a basic one using e.g. the SKLearn Boston dataset? – dreamer Oct 07 '16 at 20:34
  • Does the Boston dataset contain time series data? – Andrew Oct 07 '16 at 20:44
  • Well I'm not sure if it's really time series data, but it's not really important, as you can just treat it as such, i.e. you act as if the next number corresponds to the next datapoint (I'm not interested in keeping track of a column containing the timestamp anyway, so it doesn't matter that the dataset doesn't contain a time column), and split it into an in- and out-of-sample set to do predictions. – dreamer Oct 07 '16 at 20:47
  • Based on your initial question it sounds like the input would be one single number, right? So you want to just pick one column as input and another as output and then train the model? – Andrew Oct 07 '16 at 20:57
  • No, I think we misunderstood eachother. My input is n columns of T observations each (n timeseries). Each observation is a number, I thought that is what you meant to ask me in the comments section. To be explicit: output: y=(y1,y2,...,yT), input x=(x11,x12,...,x1T; x11,x12,...,x1T;...;xn1,x1n2,...,xnT) (a matrix of n input vectors of length T each) – dreamer Oct 07 '16 at 21:12
  • Are you trying to predict the output at each time step (e.g. if you have one set of observations every hour for 9 hours, do you want the predicted output for each hour given the data for all previous hours as input, or just 1 output after 9 hours)? – Andrew Oct 10 '16 at 16:34
  • I'm trying to predict the output at each time step (so in your example e.g. you fit the model using the data for 7 hours, and you predict each time step of the last two hours using the model fitted using the previous 7 hours). – dreamer Oct 12 '16 at 09:12
  • Ok I added another answer that I think gives you what you need. – Andrew Oct 12 '16 at 18:26
  • How do you create an embedding layer for a nested array and carry out lstm classification for each sentence? A document is made up of multiple sentences, and each sentence in turn multiple tokens: doc = [[sent1tok1, sent1tok2],[sent2tok1, sent2tok2]] and each token is a list with say 50 dims. How can you perform LSTM classification for each sentence, with the sequential document sequence though? – dter Jun 12 '18 at 00:59