1

I trained two neural networks with Keras: a MLP and a Bidirectional LSTM.

My task is to predict the words order in a sentence, so for each word, the neural network has to output a real number. When a sentence with N words is processed, the N reals number in the output are ranked in order to obtain integer numbers representing words position.

I'm using same dataset and same preprocessing on the dataset. The only different thing is that in the LSTM dataset I added padding to get the sequences of the same length.

In the prediction phase, with LSTM, I exclude the predictions created from padding vectors, since I masked them in the training phase.

MLP architecture:

mlp = keras.models.Sequential()

# add input layer
mlp.add(
    keras.layers.Dense(
        units=training_dataset.shape[1],
        input_shape = (training_dataset.shape[1],),
        kernel_initializer=keras.initializers.RandomUniform(minval=-0.05, maxval=0.05, seed=None),
        activation='relu')
    )

# add hidden layer
mlp.add(
    keras.layers.Dense(
        units=training_dataset.shape[1] + 10,
        input_shape = (training_dataset.shape[1] + 10,),
        kernel_initializer=keras.initializers.RandomUniform(minval=-0.05, maxval=0.05, seed=None),
        bias_initializer='zeros',
        activation='relu')
    )

# add output layer
mlp.add(
    keras.layers.Dense(
        units=1,
        input_shape = (1, ),
        kernel_initializer=keras.initializers.RandomUniform(minval=-0.05, maxval=0.05, seed=None),
        bias_initializer='zeros',
        activation='linear')
    )

Bidirection LSTM architecture:

model = tf.keras.Sequential()
model.add(Masking(mask_value=0., input_shape=(timesteps, features)))
model.add(Bidirectional(LSTM(units=20, return_sequences=True), input_shape=(timesteps, features)))
model.add(Dropout(0.2))
model.add(Dense(1, activation='linear'))

The task is much better solvable with an LSTM, which should capture dependencies between words well.

However, with the MLP I achieve good results, but with LSTM the results are very bad.

Since I'm a beginner, could someone understand what is wrong with my LSTM architecture? I'm going out of head.

Thanks in advance.

pairon
  • 427
  • 1
  • 7
  • 18

1 Answers1

1

For this problem, I am actually not surprised that MLP performs better.

The architecture of LSTM, bi-directional or not, assumes that location is very important to the structure. Words next to each other are more likely to be related than words farther away.

But for your problem you have removed the locality and are trying to restore it. For that problem, an MLP which has global information can do a better job at the sorting.

That said, I think there is still something to be done to improve the LSTM model.


One thing you can do is ensure that the complexity of each model is similar. You can do this easily with count_params.

mlp.count_params()
model.count_params()

If I had to guess, your LSTM is much smaller. There are only 20 units, which seems small for an NLP problem. I used 512 for a Product Classification problem to process character-level information (vocabulary of size 128, embedding of size 50). Word-level models trained on bigger data sets, like AWD-LSTM, get into the thousands of units.

So you probably want to increase that number. You can get an apples-to-apples comparison between the two models by increasing the number of units in the LSTM until the parameter counts are similar. But you don't have to stop there, you can keep increasing the size until you start to overfit or your training starts taking too long.

mcskinner
  • 2,620
  • 1
  • 11
  • 21
  • Yes, I embedded the words. If i don't remember bad, the embeddings size is around 300. In total, each word has 837 features (embeddings + one hot). So, have I to put units equals to the number of features? However my LSTM has 137,321 parameters, but MLP has 1.400.000. You are right, my model is too simple. – pairon Apr 18 '20 at 21:14
  • Thank you for the update! So yes, I would try using 10x more hidden units for 200. Also see my update, MLP may actually be better for this problem. But at least now it will be a fair fight! – mcskinner Apr 18 '20 at 21:17
  • Thank you, I'll try. However, another thing that I notice is that parameter `shuffle` is default to `True`. Do you think that I have to put it to `False`? – pairon Apr 18 '20 at 21:20
  • I would leave it at the default of `True`. This is best practice. – mcskinner Apr 18 '20 at 21:38
  • Uhm, increasing the `units`, the loss is `nan`. I'm sure that in dataset aren't missing values. – pairon Apr 18 '20 at 21:42
  • Can you remove the `Dropout` layer as well? There is no corresponding layer for the MLP. You may also need to tune your learning rate. A `nan` loss often means the gradient exploded, and one cause is the learning rate was too high and made training diverge. – mcskinner Apr 18 '20 at 21:44
  • Yes. Maybe the problem was related to weights initialization. – pairon Apr 18 '20 at 21:45
  • I have another question (sorry but I'm a beginner). despite the few `units` in the `LSTM`, the loss was significantly lower than `MLP` during training. Why this behavior? – pairon Apr 19 '20 at 12:06
  • Unfortunately, predictions still bad. Now the `units` was 436, now I try directly use `units=837`, which is the number of features. – pairon Apr 19 '20 at 15:18
  • Another thing, I had also trained the LSTM with `timesteps = 1`, so each word was processed without knowing the other words. The results was better than MLP: the reason concerns the fact of the location you mentioned in your answer? – pairon Apr 19 '20 at 15:50