0

I'm training a model with tensorflow keras and numpy input with:

epochs = 10
batch_size = 128

model.fit(
    x = [train_asset_text_seq, train_bug_text_seq],
    y = y_train.values.reshape(-1,1), 
    epochs = epochs,
    batch_size=batch_size,
    validation_data=([val_asset_text_seq, val_bug_text_seq], y_val.values.reshape(-1,1))
)

In order to speed the model building and evaluation up I wanted to make us of the tf.data input format. So I changed it to:

X_train_ds = tf.data.Dataset.from_tensor_slices((train_text_1, train_text_2))
y_train_ds = tf.data.Dataset.from_tensor_slices(y_train.values.reshape(-1,1))

X_val_ds = tf.data.Dataset.from_tensor_slices((val_text_1, val_text_2))
y_val_ds = tf.data.Dataset.from_tensor_slices(y_val.values.reshape(-1,1))


model.fit(
    tf.data.Dataset.zip((X_train_ds, y_train_ds)).batch(batch_size).repeat(),
    validation_data=tf.data.Dataset.zip((X_val_ds, y_val_ds)),
    epochs = epochs,
    steps_per_epoch=30
)

which seems to work for training but throws an error for validation with:

nput 0 of layer "lstm" is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: (124, 124)

Call arguments received by layer "model" " f"(type Functional): • inputs=('tf.Tensor(shape=(124,), dtype=int32)', 'tf.Tensor(shape=(124,), dtype=int32)') • training=False • mask=None

as you can see I'm using a lstm layer in the model. I also tried to change the fit call to use repeat, but that throws the same error as above:

model.fit(
    tf.data.Dataset.zip((X_train_ds, y_train_ds)).batch(batch_size).repeat(),
    validation_data=tf.data.Dataset.zip((X_val_ds, y_val_ds)).batch(batch_size).repeat(),
    epochs = epochs,
    steps_per_epoch=30,
    validation_steps=30
)

Do I need to adjust the model when I want to use the tf.dataset instead of the numpy input and why is it working for training but failing for validation?

Update:

I'm building a siamese network for a text classifiaction. The model is currently defined with:

input_1 = Input(shape=(train_asset_text_seq.shape[1],))
input_2 = Input(shape=(train_bug_text_seq.shape[1],))


common_embed = Embedding(
    name="synopsis_embedd",
    input_dim =len(t.word_index)+1,
    output_dim=EMBEDDING_DIM,
    input_length=train_asset_text_seq.shape[1],
    mask_zero=True
) 

lstm_1 = common_embed(input_1)
lstm_2 = common_embed(input_2)

common_lstm = LSTM(32, return_sequences=True, activation="relu")
vector_1 = common_lstm(lstm_1)
vector_1 = Dropout(0.5)(vector_1)
vector_1 = Flatten()(vector_1)

vector_2 = common_lstm(lstm_2)
vector_2 = Dropout(0.5)(vector_2)
vector_2 = Flatten()(vector_2)

x5 = Lambda(cosine_distance, output_shape=cos_dist_output_shape)([vector_1, vector_2])
    
conc = Concatenate(axis=-1)([x5, vector_1, vector_2])

x = Dense(100, activation="relu", name='conc_layer')(conc)
x = Dropout(0.1)(x)
out = Dense(1, activation="sigmoid", name = 'out')(x)

model = Model([input_1, input_2], out)
fsulser
  • 853
  • 2
  • 10
  • 34
  • 1
    `repeat()` is for when you need to make sure you don't run out of input data. This isn't your issue. Can you update your question with how you defined your model? Most likely your input layer is defined incorrectly. Switching from arrays to `Dataset` won't improve performance. They are utilized the same way. A `Dataset` variable contains arrays within them of the input and labels (and batch size, etc). It's just a complete package. – Djinn Jul 12 '22 at 18:05
  • Please give more details what are the shapes of `train_asset_text_seq`, `train_bug_text_seq`, `y_train.values`, `val_asset_text_seq`, `val_bug_text_seq` and what is your `model`? – thushv89 Jul 12 '22 at 21:14
  • @thushv89 all text sequences are the result of tokenizing and padding. I use `keras.preprocessing.text.Tokenizer`, fit the training corpus, then use `texts_to_sequences` for all strings, and then apply `keras.utils.data_utils.pad_sequences` to all of them. So it's 2d numpy arrays. the `y_train.values` is ndarray with floats (1.0 or 0.0), since I created a siamese network where I have binary classification. – fsulser Jul 13 '22 at 06:50
  • @Djinn Thanks for the performance feedback, I was expecting that. Nevertheless I would be glad to understand the issue better for the future. I added the model definition to the description. – fsulser Jul 13 '22 at 06:53
  • 1
    LSTM requires input shape of two dimensions `(t, n)`, where `t` is the time sequence, and `n` is the number of features. You're passing an input shape of one dimension. – Djinn Jul 13 '22 at 07:50
  • Thanks for the explanation! Is that is only needed in case of tf.data and not when using a ndarray as input? What should be the time sequence? – fsulser Jul 13 '22 at 08:07
  • That's always the case. You have to have the time sequence. You should have them yourself, it could be any type of time sequence – Djinn Jul 13 '22 at 15:08

0 Answers0