[keras]Many to many probability prediction using LSTM

Question

My problem is to predict the error probability of a time series data. In the data, we have (n_samples, timesteps, features), where timesteps are the maximum length of the time series. The training y_train has one_hot labels of each time point being an error or not.

X_train and y_train are padded with zeros, so a masking layer is added.

In order to predict the error probability, I have an implementation as below:

model = Sequential()
model.add(Masking(mask_value = 0, input_shape = (X_train.shape[1], 
X_train.shape[2])))
model.add(Bidirectional(LSTM(para['hiddenStateSize'], 
return_sequences = True)))

model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(2)))
model.add(TimeDistributed(Activation('softmax')))
model.compile(loss = 'binary_crossentropy', optimizer = 'adam')
print(model.summary())

model.fit(X_train, y_train, epochs = epochs, batch_size = batch_size, shuffle = False)

The question is: The first data points in the samples are always over-predicted. Is there a better implementation of the problem?

LSTM's do take a few steps to start understanding how the sequence works. You could maybe use longer sequences and discard the edges? (Since you're using `Bidirectional`, this is probably happening on the other end too - beware of the masking) . — Daniel Möller, Oct 25 '17 at 12:43
@MarcinMożejko, by over-predicted I mean the predicted probability of error is higher than it should be. This dataset is imbalanced though, with 99% of the data point being correct (not error). — Dynasty1010, Oct 25 '17 at 17:14
Can you test your data in order to check if a probability of having an error in a first step is significantly higher than in other steps? Maybe your network is learning a proper pattern? — Marcin Możejko, Oct 25 '17 at 20:55
@MarcinMożejko, thanks for your comments. The error actually got accumulated overtime, which means the first several time points should have less error than the latter ones. — Dynasty1010, Oct 25 '17 at 21:51

score 0 · Answer 1 · answered Oct 26 '17 at 00:55

0

I haven't experimented much with bidirectional lstms for time-series prediction but here are 2 things that I would change in your model:

I would either use categorical_crossentropy with Dense(2) and softmax:

model.add(TimeDistributed(Dense(2)))
model.add(TimeDistributed(Activation('softmax')))
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')

Or binary_crossentropy with Dense(1) and sigmoid:

model.add(TimeDistributed(Dense(1)))
model.add(TimeDistributed(Activation('sigmoid')))
model.compile(loss = 'binary_crossentropy', optimizer = 'adam')

Also try the dropout (or recurrent_dropout) parameter of the LSTM

model.add(Bidirectional(LSTM(64, dropout=0.2, return_sequences = True)))

answered Oct 26 '17 at 00:55

Julio Daniel Reyes

5,489
1
19
23

thanks@Julio. Any reason why categorical_crossentropy works on one_hot but binary_crossentropy works on label? – Dynasty1010 Oct 27 '17 at 00:02
For the formulation and difference let me refer you to [this question](https://datascience.stackexchange.com/questions/9302/the-cross-entropy-error-function-in-neural-networks) but basically both are log likelihood estimators, let's say categorical crossentropy is the general case that expects more than one output but binary crossentropy expects just one, and that output needs to be either 0 or 1. – Julio Daniel Reyes Oct 27 '17 at 01:27

[keras]Many to many probability prediction using LSTM

1 Answers1