I'm trying to implement a neural network that takes the input of musical note/pitch on one axis and octave of that note on the other axis.
The input is supposed to go through a convolution layer (Conv2DLayer). After convolution, the outputs should go through an LSTM layer.
Input -> Convolution and pooling layers -> LSTM layers -> Output
The problem is that LSTM layers and Convolution layers have a specific input shape
Conv2DLayer expected input shape: (batch_size, num_channels, rows, columns) LSTMLayer expected input shape: (batch_size, sequence_len, num_inputs)
How can I take an input of shape (batch_size, sequence_len, num_channels, rows, columns) or similar and build such a network? If I reshape and flatten the shape by removing sequence_len then either rows or columns would have to change and the shape will be distorted.