6

I'm facing the following issue. I have a large number of documents that I want to encode using a bidirectional LSTM. Each document has a different number of words and word can be thought of as a timestep.

When configuring the bidirectional LSTM we are expected to provide the timeseries length. When I am training the model this value will be different for each batch. Should I choose a number for the timeseries_size which is the biggest document size I will allow? Any documents bigger than this will not be encoded?

Example config:

Bidirectional(LSTM(128, return_sequences=True), input_shape=(timeseries_size, encoding_size))
Maxim
  • 52,561
  • 27
  • 155
  • 209
Funzo
  • 1,190
  • 2
  • 14
  • 25

2 Answers2

2

This is a well-known problem and it concerns both ordinary and bidirectional RNNs. This discussion on GitHub might help you. In essence, here are the most common options:

  • A simple solution is to set the timeseries_size to be the max length over the training set and pad the shorter sequences with zeros. Example Keras code. An obvious downside is memory waste if the training set happens to have both very long and very short inputs.

  • Separate input samples into buckets of different lengths, e.g. a bucket for length <= 16, another bucket for length <= 32, etc. Basically this means training several separate LSTMs for different sets of sentences. This approach (known as bucketing) requires more effort, but currently considered most efficient and is actually used in the state-of-the-art translation engine Tensorflow Neural Machine Translation.

Maxim
  • 52,561
  • 27
  • 155
  • 209
1

One option is to configure the model with a variable length: input_shape=(None,encoding_size).

You will have then to train each document individually.

for epoch in range(epochs):
    for document,output in zip(list_of_documents,list_of_outputs):
        #make sure you have a batch, even if its size is 1
        document.reshape((1,this_doc_length,encoding_size)) 
        output.reshape((1,whatever_output_shape,...))   

        model.train_on_batch(document,output,....)

Other options are to use a single input array with the maximum length and add Masking layers. (If you're using an Embedding layer, it's as easy as making the parameter mask_zero=True)

This works pretty well for one-directional LSTMs, but I'm not sure it's correctly implemented for bidirectional (never tested).

Daniel Möller
  • 84,878
  • 18
  • 192
  • 214