I am learning about tensorflow, and seq2seq problems for machine translation. For this I gave me the following task:
I created an Excel, containing random dates in different types, for example:
- 05.09.2192
- martes, 07 de mayo de 2329
- Friday, 30 December, 2129
In my dataset, each type is occuring 1000 times. These are my train (X) value. My target (Y) values are in one half always in this type:
- 05.09.2192
- 07.03.2329
- 30.12.2129
And in another half in this type:
- Samstag, 12. Juni 2669
- Donnerstag, 1. April 2990
- Freitag, 10. November 2124
To make the model beeing able to differentiate these two Y values, another context information (C) is given as text:
- Ausgeschrieben (written out)
- Datum (date)
So some rows look like this:
So my goal is, to create a model, which is able to "translate" any date type to the german date type e.g. 05.09.2192.
The dataset contains 34.000 pairs.
To solve this, I use a character based tokenizer to transform text into integers:
tokenizer = keras.preprocessing.text.Tokenizer(filters='', char_level=True, oov_token="|")
I use an LSTM encoder-decoder model and I expect it, to reach an perfect accuracy, since the dependency between X and Y can be solved perfectly.
However, I reach up to an maximum of 72% of accuracy. Even worse, the accuracy is only reaching that much, because the padding is generated well. E.g. most of the Y values are pretty short and are therefore padded. So 12.02.2001
becomes e.g. ||||||||||||||||||||12.02.2001
. So the model learns well to generate the padding token, but not the expected value.
This is the model structure I used at my latest test:
from tensorflow.keras.layers import Concatenate
encoder_inputs = keras.layers.Input(batch_input_shape=[32,None], dtype=np.int32)
decoder_inputs = keras.layers.Input(batch_input_shape=[32,None], dtype=np.int32)
embeddings = keras.layers.Embedding(vocab_size, 1)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)
encoder_0 = keras.layers.Dense(128)(encoder_embeddings)
encoder_0d = keras.layers.Dropout(0.4)(encoder_0)
encoder_0_1 = keras.layers.Dense(256)(encoder_0d)
encoder_0_1d = keras.layers.Dropout(0.2)(encoder_0_1)
encoder_0_2 = keras.layers.Dense(128)(encoder_0_1d)
encoder_0_2d = keras.layers.Dropout(0.05)(encoder_0_2)
encoder_0_3 = keras.layers.Dense(64)(encoder_0_2d)
encoder_1 = keras.layers.LSTM(64, return_state=True, return_sequences=True, recurrent_dropout=0.2)
encoder_lstm_bidirectional = keras.layers.Bidirectional(encoder_1)
encoder_output, state_h1, state_c1, state_h2, state_c2 = encoder_lstm_bidirectional(encoder_0_3)
encoder_state = [Concatenate()([state_h1, state_h2]), Concatenate()([state_c1, state_c2])]
sampler = tfa.seq2seq.sampler.TrainingSampler()
decoder_cell = keras.layers.LSTMCell(64*2)
output_layer = keras.layers.Dense(vocab_size)
decoder = tfa.seq2seq.basic_decoder.BasicDecoder(decoder_cell, sampler, output_layer=output_layer)
final_outputs, final_state, final_sequence_lengths = decoder(decoder_embeddings, initial_state=encoder_state,
sequence_length=[sequence_length], training=True)
y_proba = tf.nn.softmax(final_outputs.rnn_output)
model = keras.Model(inputs=[encoder_inputs, decoder_inputs], outputs=[y_proba])
If needed, I can deploy the whole notebook in github, but maybe there is a simple solution, I just did not see so far. Thanks for your help!