I'm trying to implement a simple image to text network. For every image the NN has an output of 5 character and there are 23 possible characters, so my labels are 5x23. When I try to fit the model I get the following error
ValueError: Data cardinality is ambiguous:
(...)
Passing just one example, the message is this
ValueError: Data cardinality is ambiguous:
x sizes: 1
y sizes: 5
Please provide data which shares the same first dimension.
How can I properly train the network for this task?
The model is the following
input_layer = Input(shape=(h, w, 1))
x = Conv2D(conv_filters, kernel_size, activation='relu')(input_layer)
x = MaxPooling2D((pool_size,pool_size))(x)
x = Conv2D(conv_filters, kernel_size, activation='relu')(x)
x = MaxPooling2D((pool_size,pool_size))(x)
x = Reshape(target_shape=(5,-1))(x)
x = Dense(5, activation='swish')(x)
fw = GRU(128, return_sequences=True, kernel_initializer='he_normal')(x)
bw = GRU(128, return_sequences=True, go_backwards=True, kernel_initializer='he_normal')(x)
bgru = add([fw, bw])
output = Dense(n_tokens, activation='softmax')(bgru)