0

I am creating a concatenated model using keras. For now, I am keeping it simple, using only Dense layers and without any kind of hyperparameters optimization. My model should be able to get data from two different datasets, with a different number of samples in each dataset.

After creating and compiling the model, when I try to fit the model on the two datasets, I get this error:

ValueError: Data cardinality is ambiguous:
  x sizes: 2093, 807
  y sizes: 2093, 807
Make sure all arrays contain the same number of samples.

2093 and 807 are the number of lines in the two different datasets.

I would have expected each base model to learn independently from the other, using only the input data available to that model, and then the concatenated model to output a prediction based on the characteristics of each sample in the test set. I know I could pad the two datasets, adding rows full of 0 for each sample that does not have any measurements in that dataset, but I would prefer to avoid it, if possible. Does anyone know a workaround for this kind of problem?

I checked similar questions, and they are mainly getting the same error when they do not intend for the cardinality to be different, while it would be an intended feature in my model.

Thanks in advance

EDIT: here is my code, in case it helps.

print(x_train_diags.shape)
print(x_train_labs.shape)
print(y_train_diags.shape)
print(y_train_labs.shape)

input_diags = tf.keras.layers.Input(shape=(1032,))
dense_1_diags = tf.keras.layers.Dense(16, activation = tf.keras.activations.elu)(input_diags)
dense_2_diags = tf.keras.layers.Dense(4, activation = tf.keras.activations.elu)(dense_1_diags)

input_labs = tf.keras.layers.Input(shape=(230,))
dense_1_labs = tf.keras.layers.Dense(16, activation = tf.keras.activations.elu)(input_labs)
dense_2_labs = tf.keras.layers.Dense(4, activation = tf.keras.activations.elu)(dense_1_labs)

concatenation_layer = tf.keras.layers.Concatenate()([dense_2_diags, dense_2_labs])

output = tf.keras.layers.Dense(units = 1, activation=tf.keras.activations.elu)(concatenation_layer)

full_model = tf.keras.Model(inputs=[input_diags, input_labs], outputs=[output])

full_model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=False), 
                                    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
                                    metrics=[tf.keras.metrics.AUC(), tf.keras.metrics.Precision(), tf.keras.metrics.Recall()]
                    )
full_model.fit([x_train_diags, x_train_labs], [y_train_diags, y_train_labs], batch_size=64, epochs=5)

The results of the print statements are:

(2093, 1032) (807, 230) (2093, 1) (807, 1)

so everything is as expected there.

  • Can you share the code when we're building the datasets? – Shubham Panchal Dec 05 '22 at 15:23
  • The code to build the datasets is quite long, since a lot of preprocessing was needed to get every single dataset. Moreover, the original dataset is private and it can't be shared. I can share the code for the concatenated model, if that helps. – Foxtrot_Romeo Dec 05 '22 at 15:32
  • The model is fine the problem is for `x_train_diags` you have 2093 samples where on the otherside for `x_train_labs` you have 807 samples. Your samples must be same for both. Your shapes shall be like this `(2093, 1032) (2093, 230) (2093, 1) (2093, 1)` or like this `(807, 1032) (807, 230) (807, 1) (807, 1)` – Mohammad Ahmed Dec 05 '22 at 16:42
  • @Mohammed That is exactly what I am asking, is there a way to avoid this? one of the advantages of having such an architecture would be to be able to use different data sources, with more flexibility – Foxtrot_Romeo Dec 05 '22 at 19:46

0 Answers0