-1

Hello I am trying to build a seq2seq model to generate some music. I really dont know much about it though. On the internet I have found this model:

def createSeq2Seq():
    #seq2seq model

    #encoder
    model = Sequential()
    model.add(LSTM(input_shape = (None, input_dim), units = num_units, activation= 'tanh', return_sequences = True ))
    model.add(BatchNormalization())
    model.add(Dropout(0.3))
    model.add(LSTM(num_units, activation= 'tanh'))

    #decoder
    model.add(RepeatVector(y_seq_length))
    num_layers= 2
    for _ in range(num_layers):
        model.add(LSTM(num_units, activation= 'tanh', return_sequences = True))
        model.add(BatchNormalization())
        model.add(Dropout(0.3))

    model.add(TimeDistributed(Dense(output_dim, activation= 'softmax')))
    return model

My data is a list of pianorolls. A piano roll is a matrix with the columns representing a one-hot encoding of the different possible pitches (49 in my case) with each column representing a time (0,02s in my case). The pianoroll matrix is then only ones and zeros.

I have prepared my training data reshaping my pianoroll songs (putting them all one after the other) into shape = (something, batchsize, 49). So my input data are all the songs one after the other separeted in blocks of size the batchsize. My training data is then the same input but delayed one batch.

The x_seq_length and y_seq_length are equal to the batch_size. Input_dim = 49

My input and output sequences have the same dimension.

Have I made any mistake in my reasoning? Is the seq2seq model Ive found correct? What does the RepeatVector does?

1 Answers1

0

This is not a seq2seq model. RepeatVector takes the last state of the last encoder LSTM and makes one copy per output token. Then you feed these copies into a "decoder" LSTM, which thus has the same input in every time step.

A proper autoregressive decoder takes its previous outputs as input, i.e., at training time, the input of the decoder is the same as its output, but shifted by one position. This also means that your model misses the embedding layer for the decoder inputs.

Jindřich
  • 10,270
  • 2
  • 23
  • 44