Bad results in Autoencoder when using non-linear activation functions in combination with softmax

Question

I am training an autoencoder on data where each observation is a p=[p_1, p_2,...,p_n] where 0<p_i<1 for all i. Furthermore, each input p can be partitioned in parts where the sum of each part equals 1. This is because the elements represent parameters of a categorical distribution and p contains parameters for multiple categorical distributions. As an example, the data that I have comes from a probabilistic database that may look like this:

In order to enforce this constraint on my output, I use multiple softmax activations in the functional model API from Keras. In fact, what I am doing is similar to multi-label classification. This may look as follows:

The implementation is as follows:

encoding_dim = 3
numb_cat = len(structure)
inputs = Input(shape=(train_corr.shape[1],))
encoded = Dense(6, activation='linear')(inputs)
encoded = Dense(encoding_dim, activation='linear')(encoded)
decoded = Dense(6, activation='linear')(encoded)
decodes = [Dense(e, activation='softmax')(decoded) for e in structure]

losses = [jsd for j in range(numb_cat)]  # JSD loss function
autoencoder = Model(inputs, decodes)
sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
autoencoder.compile(optimizer=sgd, loss=losses, loss_weights=[1 for k in range(numb_cat)])

train_attr_corr = [train_corr[:, i:j] for i, j in zip(np.cumsum(structure_0[:-1]), np.cumsum(structure_0[1:]))]
test_attr_corr = [test_corr[:, i:j] for i, j in zip(np.cumsum(structure_0[:-1]), np.cumsum(structure_0[1:]))]

history = autoencoder.fit(train_corr, train_attr_corr, epochs=100, batch_size=2,
                          shuffle=True,
                          verbose=1, validation_data=(test_corr, test_attr_corr))

where structure is a list containing the number of categories that each attributes has, thus governing which nodes in the output layer are grouped and go to the same softmax layer. In the example above, structure = [2,2]. Furthermore, loss jsd is a symmetric version of the KL-divergence.

Question:

When using linear activation functions, the results are pretty good. However, when I try to use non-linear activation functions (relu or sigmoid), the results are much worse. What might be the reason for this?

If this is an autoencoder, how is "one input" supposed to be compared with `e` outputs? — Daniel Möller, Dec 12 '19 at 13:29
You must make sure that `y_train` is within the range of the activation, and that it's the correct data, and matches `x_train` accordingly — Daniel Möller, Dec 12 '19 at 13:31
@DanielMöller: it's not `e` outputs. The `structure` list may for example be `[2, 3, 2]`, meaning that we have three attributes, having 2, 3, 2, categories respectively. This means that the the first two output nodes get compared with the first two elements of the input, etc. — Rutger Mauritz, Dec 12 '19 at 13:40
I understood that. But there are definitely `e` outputs in your model and one input. A lot of code is missing for this to be possible. Aren't you missing a `Concatenate` or something? It's not possible to answer the question without the proper code. — Daniel Möller, Dec 12 '19 at 13:44
@DanielMöller You are right. I forgot to display an important section, which I now did. — Rutger Mauritz, Dec 12 '19 at 13:47
I basically partition the input-data `train_corr` into a list of numpy arrays that can be compared with the output of the model, one on one. — Rutger Mauritz, Dec 12 '19 at 13:49
The next step would be double check the `jsd` loss...maybe use a `'categorical_crossentropy'`. And, of course, print each of the outputs, `max` and `min` to make sure they are between 0 and 1. — Daniel Möller, Dec 12 '19 at 13:55
@DanielMöller The `jsd` loss gives much better results than the `categorical_crossentropy`. I use a KL-divergence type of loss since I am in fact comparing categorical probability distributions with each other. The `jsd` loss is just a symmetric version of the KL-divergence: https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence. Furthermore, the outputs are all good, between 0 and 1. — Rutger Mauritz, Dec 12 '19 at 14:02
Not an expert here, but from what I learned about KL loss, comparing distributions is very different from comparing labels and I can't imagine how it could be used for this problem. — Daniel Möller, Dec 12 '19 at 14:06
Say we have the first record of the database mentioned above: `[0.9, 0.1, 1.0, 0.0]`. The AE takes this as input and reconstructs it to `[q1, q2, q3, q4]`. What the AE then does is calculates: `jsd([0.9, 0.1], [q1, q2])` & `jsd([1.0, 0.0], [q3, q4])`, whose sum is then used to backpropagate through the network. Such a loss makes sense since we are measuring the difference between categorical distributions (2 in this case, one for 'gender' and one for 'favourite colour'). — Rutger Mauritz, Dec 12 '19 at 14:13

Bad results in Autoencoder when using non-linear activation functions in combination with softmax

0 Answers0