The limitation mentioned in the question occurred in a Keras context.
I've read numerous posts regarding how to have variable length sequences in batches (and I understand the replies to these posts), however the only post I've found regarding why is here on Data Science, with the answer being "Within a single batch, you must have the same number of timesteps since it must be a tensor (this is typically where you see 0-padding)."
However, this seems to be an unnecessary restriction (I am not very familiar with Keras/TensorFlow, so my question from a perspective not specific to any API).
Within training batches, why can the data entries (I brought up the example of sentences) not have variable lengths (in my example, that would be number of words)? Since variable length sequences are an application of RNNs, this question boils down to Why can there not be a variable amount of time steps in an RNN during training, given a batch?
Here are the following reasons that made me question the lack of variable length sequences in batches:
1) Data entries, regardless of how big of a batch they are part of, have gradients of the RNN's parameters associated with the entries. Batch size only affects when you actually change the parameters of a network based on said computed gradients (the average is taken and then applied based on other hyper parameters). Variable-length sequences will have variable amount of time steps, however the gradient associated with each entry already averages the influence of a network's parameter per timestep of itself (and an average is possible given any number of time steps), hence regardless of the number of time steps the gradient can successfully be computed per entry, and thus, for the entire batch (by taking average gradient of all entries).
2): Parallelism of matrix multiplication is still possible as normal given a batch with variable-length sequences because matrix multiplication is parallelized for each entry in the batch which would be one sequence, hence a fixed length.