1

The limitation mentioned in the question occurred in a Keras context.

I've read numerous posts regarding how to have variable length sequences in batches (and I understand the replies to these posts), however the only post I've found regarding why is here on Data Science, with the answer being "Within a single batch, you must have the same number of timesteps since it must be a tensor (this is typically where you see 0-padding)."

However, this seems to be an unnecessary restriction (I am not very familiar with Keras/TensorFlow, so my question from a perspective not specific to any API).

Within training batches, why can the data entries (I brought up the example of sentences) not have variable lengths (in my example, that would be number of words)? Since variable length sequences are an application of RNNs, this question boils down to Why can there not be a variable amount of time steps in an RNN during training, given a batch?

Here are the following reasons that made me question the lack of variable length sequences in batches:

1) Data entries, regardless of how big of a batch they are part of, have gradients of the RNN's parameters associated with the entries. Batch size only affects when you actually change the parameters of a network based on said computed gradients (the average is taken and then applied based on other hyper parameters). Variable-length sequences will have variable amount of time steps, however the gradient associated with each entry already averages the influence of a network's parameter per timestep of itself (and an average is possible given any number of time steps), hence regardless of the number of time steps the gradient can successfully be computed per entry, and thus, for the entire batch (by taking average gradient of all entries).

2): Parallelism of matrix multiplication is still possible as normal given a batch with variable-length sequences because matrix multiplication is parallelized for each entry in the batch which would be one sequence, hence a fixed length.

Mario Ishac
  • 5,060
  • 3
  • 21
  • 52
  • Because Tensorflow (or any other framework) works on tensors/arrays and those can't be "ragged" like [[1,2,3],[4,5]]. They need to be rectangular. There are some very smart people behind these frameworks and I'm sure that if it was simple to allow for ragged arrays without significant sacrifices in efficiency they would have done so. – xdurch0 Aug 26 '18 at 07:29
  • @xdurch0 do you have any broad reasons or key terms (that I can perform research on/look up) as to why there are sacrifices in efficiency? – Mario Ishac Aug 26 '18 at 17:26
  • 1
    Afraid not, I'm just a TF user myself and don't know much about the nitty-gritty details. I did actually perform a bit of a Google search but couldn't find anything yet... I suppose it is related to very low-level C or CUDA implementation issues. Maybe it would be worth rephrasing the question in more elementary terms (i.e. about array shapes, not RNNs etc.) and asking in the respective communities -- maybe we could get some of those "very smart people" on the case. – xdurch0 Aug 27 '18 at 17:45
  • FWIW, Keras actually does admit masked variable length sequences as valid input to an RNN. This thread may help you: https://github.com/keras-team/keras/issues/4624 – Pranav Vempati Sep 07 '18 at 01:33

1 Answers1

1

This post walks us through the RNN implementation. It's definitely the case that if we only have one batch we can take the max length sequence and pad the rest of the sequence. And you can have variable length batch sizes by having a batch size of 1 :) But if you want to use batches and take advantage of the fact that matrix operations aren't much slower than doing one training example at a time then you need matrixes not ragged tensors (this might be because that is what the majority of matrix libraries are expecting). So if you want variable length sequences within a batch you'll have to pad the shorter sequences and truncate the larger sequences. Doing this dynamically per batch seems like it would just hide more details and make any training issues more difficult to understand without any benefit.

The question is what benefit are you expecting to get from dynamic length sequences? It seems unreasonable to expect an efficiency speedup from this as I'm confident they would implement it that way if it were so. Does it make it easier to use the library that way? Then maybe the answer is that the padding is a dangerous detail to hide from the user, or not enough users have requested it. Do you expect it to train better? This is plausible to me but prob doesn't give enough benefit to be worth the implementation complexity. You'll probably need to preprocess your data either way to decide how to handle very long sequences.

In summary I'd be interested in seeing a comparison between training with batch sizes of one so that the gradients are specific each time to the length of that training example and the current batch approach. This will be much slower but if the model trains better than maybe it's worth investigating further.

emschorsch
  • 1,619
  • 3
  • 19
  • 33
  • The benefit I wish to get from dynamic length sequences it the ability to pass in new sequences (outside of the training) longer than max length (I don't see why keras puts this limitation, as normal RNNs should be able to handle this). Even if longer sequences would be generally more inaccurate the more they go past `MAX_LENGTH`, there isn't a limitation of RNN that prohibits them. – Mario Ishac Sep 08 '18 at 04:51
  • At inference time or at training time? – emschorsch Sep 08 '18 at 04:53
  • Inference time, after I have confirmed my model is good and have pushed it to production for others to use (and thus submit their own data, which I would like to allow to be longer than the data I trained the network on). – Mario Ishac Sep 08 '18 at 04:56
  • Ah, I misunderstood the question. Because you were talking about batches I assumed you cared most about training. – emschorsch Sep 08 '18 at 05:20