6

I am looking at panel data, which is structured like this:

D = \{(x^{(k)}_{t},y^{(k)}_{t})\,|\, k=1,\dots,N\, , t=t_0,\dots,t_k \}_{k=1}^{N}

where x^{(k)} denotes the k'th sequence, x^{(k)}_{t} denotes the k'th sequences value at time t , furthermore x^{(k)}_{i,t} is the i'th entry in the vector x^{(k)}_{t}. That is x^{(k)}_{t} is the feature vector of the k'th sequence at time t. The sub- and super scripts mean the same things for the label data y^{(k)}_{t}, but here y^{(k)}_{t} \in \{0,1\}.

In plain words: The data set contains individuals observed over time, and for each time point at which an individual is observed, it is recorded whether he bought an item or not ( y\in \{0,1\}).

I would like to use a recurrent neural network with LSTM units from Keras for the task of predicting whether a person will buy an item or not, at a given time point. I have only been able to find examples of RNN's where each sequence has a label value (philipperemy link), not an example where each sequence element has a label value as in the problem I described.

My approach so far, has been to create a tensor with dimensions (samples,timesteps,features) but I cannot figure out how to format the labels, such that keras can match them with the features. It should be something like this (samples,timesteps,1), where the last dimension indicates a single dimension to contain the label value of 0 or 1.

Furthermore some of the approaches that I have come across splits sequences such that subsequences are add to the training data, thus increasing the need for memory tremendously (mlmastery link). This is infeasible in my case, as I have multiple GB's of data, and I would not be able to store it in memory if I added subsequences.

The model I would like to use is something like this:

mod = Sequential()
mod.add(LSTM(30,input_dim=116, return_sequences = True))
mod.add(LSTM(10))
mod.add(Dense(2))

Does anyone have experience working with panel data in keras?

i.n.n.m
  • 2,936
  • 7
  • 27
  • 51
Math_kv
  • 309
  • 3
  • 10
  • 1
    Math mode doesnt seem work, I followed this tutorial: http://meta.math.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference – Math_kv Mar 09 '17 at 11:57
  • I am wondering if you are still on stackoverflow and if you would mind posting your data and full model? I am trying to learn keras for panel and my data is similar to yours, but there is not much out there for panel keras examples. – John Stud Feb 02 '19 at 02:44
  • Hi John, unfortunately I don't have access to the data or the model anymore. – Math_kv Feb 07 '19 at 12:32

2 Answers2

5

Try:

mod = Sequential()
mod.add(LSTM(30, input_shape=(timesteps, features), return_sequences = True))
mod.add(LSTM(10, return_sequences = True))
mod.add(TimeDistributed(Dense(1, activation='sigmoid')))
# In newest Keras version you can change the line above to mod.add(Dense(1, ..))

mod.compile(loss='binary_crossentropy', optimizer='rmsprop')
Marcin Możejko
  • 39,542
  • 10
  • 109
  • 120
  • 1
    Does it matter what batch size you use for panel data? Can the batch size be more than 1 individual? – gannawag Aug 07 '17 at 14:44
0

It looks like the only option is to run the lstm for each individual (here it is a sequence) separately when the data is not balanced as I assume this since time depends on k in your question.

Can
  • 1
  • 1