My problem is to predict the error probability of a time series data. In the data, we have (n_samples, timesteps, features), where timesteps are the maximum length of the time series. The training y_train has one_hot labels of each time point being an error or not.
X_train and y_train are padded with zeros, so a masking layer is added.
In order to predict the error probability, I have an implementation as below:
model = Sequential()
model.add(Masking(mask_value = 0, input_shape = (X_train.shape[1],
X_train.shape[2])))
model.add(Bidirectional(LSTM(para['hiddenStateSize'],
return_sequences = True)))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(2)))
model.add(TimeDistributed(Activation('softmax')))
model.compile(loss = 'binary_crossentropy', optimizer = 'adam')
print(model.summary())
model.fit(X_train, y_train, epochs = epochs, batch_size = batch_size, shuffle = False)
The question is: The first data points in the samples are always over-predicted. Is there a better implementation of the problem?