How to iterate through tensors in custom loss function?

Question

I'm using keras with tensorflow backend. My goal is to query the batchsize of the current batch in a custom loss function. This is needed to compute values of the custom loss functions which depend on the index of particular observations. I like to make this clearer given the minimum reproducible examples below.

(BTW: Of course I could use the batch size defined for the training procedure and plugin it's value when defining the custom loss function, but there are some reasons why this can vary, especially if epochsize % batchsize (epochsize modulo batchsize) is unequal zero, then the last batch of an epoch has different size. I didn't found a suitable approach in stackoverflow, especially e. g. Tensor indexing in custom loss function and Tensorflow custom loss function in Keras - loop over tensor and Looping over a tensor because obviously the shape of any tensor can't be inferred when building the graph which is the case for a loss function - shape inference is only possible when evaluating given the data, which is only possible given the graph. Hence I need to tell the custom loss function to do something with particular elements along a certain dimension without knowing the length of the dimension.

(this is the same in all examples)

from keras.models import Sequential
from keras.layers import Dense, Activation

# Generate dummy data
import numpy as np
data = np.random.random((1000, 100))
labels = np.random.randint(2, size=(1000, 1))

model = Sequential()
model.add(Dense(32, activation='relu', input_dim=100))
model.add(Dense(1, activation='sigmoid'))

example 1: nothing special without issue, no custom loss

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])    

# Train the model, iterating on the data in batches of 32 samples
model.fit(data, labels, epochs=10, batch_size=32)

(Output omitted, this runs perfectily fine)

example 2: nothing special, with a fairly simple custom loss

def custom_loss(yTrue, yPred):
    loss = np.abs(yTrue-yPred)
    return loss

model.compile(optimizer='rmsprop',
              loss=custom_loss,
              metrics=['accuracy'])

# Train the model, iterating on the data in batches of 32 samples
model.fit(data, labels, epochs=10, batch_size=32)

(Output omitted, this runs perfectily fine)

example 3: the issue

def custom_loss(yTrue, yPred):
    print(yPred) # Output: Tensor("dense_2/Sigmoid:0", shape=(?, 1), dtype=float32)
    n = yPred.shape[0]
    for i in range(n): # TypeError: __index__ returned non-int (type NoneType)
        loss = np.abs(yTrue[i]-yPred[int(i/2)])
    return loss

model.compile(optimizer='rmsprop',
              loss=custom_loss,
              metrics=['accuracy'])

# Train the model, iterating on the data in batches of 32 samples
model.fit(data, labels, epochs=10, batch_size=32)

Of course the tensor has not shape info yet which can't be inferred when building the graph, only at training time. Hence for i in range(n) rises an error. Is there any way to perform this?

The traceback of the output:

-------

BTW here's my true custom loss function in case of any questions. I skipped it above for clarity and simplicity.

def neg_log_likelihood(yTrue,yPred):
    yStatus = yTrue[:,0]
    yTime = yTrue[:,1]    
    n = yTrue.shape[0]    
    for i in range(n):
        s1 = K.greater_equal(yTime, yTime[i])
        s2 = K.exp(yPred[s1])
        s3 = K.sum(s2)
        logsum = K.log(y3)
        loss = K.sum(yStatus[i] * yPred[i] - logsum)
    return loss

Here's an image of the partial negative log-likelihood of the cox proportional harzards model.

This is to clarify a question in the comments to avoid confusion. I don't think it is necessary to understand this in detail to answer the question.

The answer is: don't iterate. I'd help, but there are so many strange things in your loss function that I can't understand it. But you know that `yTrue` and `yPred` have **always** the same shape, right? And that by you examples, there isn't a `yTrue[:,1]`. — Daniel Möller, Jul 18 '18 at 16:05
I know that this iteration doesn't work. And it is clear that it doesn't work. Because shape can't be inferred at the time when building the graph. Thats why I wrote the issue. The question is, how this can be done instead? Regarding your question, I have to contradict, the shape can be different, depending on the loss function. The loss function is the partial log likelihood of the cox model, meaning yTrue is (status,time). status is 0 for censored, 1 for event. time is the obseration time for the observed status. yPred is the parameter of the cox model which minimizes the loss function. — Thomas, Jul 19 '18 at 06:48
I've added a picture of the loss function. Even though it is not really necessary to overcome the issue (example 3) I want to avoid confusion. — Thomas, Jul 19 '18 at 06:49
As someone that's been using keras for years, I repeat: `y_true` and `y_pred`, both have exactly the same shape, always. This is the shape of what you passed as `y_train` divided in batches. It's simply impossible to have them with different shapes. — Daniel Möller, Jul 19 '18 at 12:57
Just to see if I understand the picture of the loss, what you expect that your model predict is `Bx` (as if they were a single var)? Do you have expected known values for `Bx`? Where is `delta` coming from, is it always given or you also want the model to predict `delta`? Is your output shape really `(batch,1)` or was that just a test? Should the samples in the batch be ordered as if they were a timeline? Is there only one timeline in your entire data? — Daniel Möller, Jul 19 '18 at 12:57
These questions are important to see whether you should have it as a "loss function" (as defined by keras) or you should incorporate this into the model using a dummy loss function with dummy `y_trues`. To see if it's possible to have more than one batch, and if the samples must be protected against shuffling.... — Daniel Möller, Jul 19 '18 at 12:59
Dear Daniel, thanks for your time and effort. I found two solutions (1) a non-efficient one using looping (2) one using the tensorflow backend and it's vectorizations. After I've done a few refinements I'll post it here and answer all your questions. — Thomas, Jul 20 '18 at 11:10
@Thomas, how did you solve the problem? Can you post it here? I really would like to know. — Michelle Owen, Apr 16 '19 at 13:09
@DanielMöller, in fact, y_pred and y_true can have different shapes... — Michelle Owen, Apr 16 '19 at 13:16

Daniel Möller · Accepted Answer · 2019-11-28T17:13:17.817

As usual, don't loop. There are severe performance drawbacks and also bugs. Use only backend functions unless totally unavoidable (usually it's not unavoidable)

Solution for example 3:

So, there is a very weird thing there...

Do you really want to simply ignore half of your model's predictions? (Example 3)

Assuming this is true, just duplicate your tensor in the last dimension, flatten and discard half of it. You have the exact effect you want.

def custom_loss(true, pred):
    n = K.shape(pred)[0:1]

    pred = K.concatenate([pred]*2, axis=-1) #duplicate in the last axis
    pred = K.flatten(pred)                  #flatten 
    pred = K.slice(pred,                    #take only half (= n samples)
                   K.constant([0], dtype="int32"), 
                   n) 

    return K.abs(true - pred)

Solution for your loss function:

If you have sorted times from greater to lower, just do a cumulative sum.

Warning: If you have one time per sample, you cannot train with mini-batches!!!
batch_size = len(labels)

It makes sense to have time in an additional dimension (many times per sample), as is done in recurrent and 1D conv netoworks. Anyway, considering your example as expressed, that is shape (samples_equal_times,) for yTime:

def neg_log_likelihood(yTrue,yPred):
    yStatus = yTrue[:,0]
    yTime = yTrue[:,1]    
    n = K.shape(yTrue)[0]    


    #sort the times and everything else from greater to lower:
    #obs, you can have the data sorted already and avoid doing it here for performance

    #important, yTime will be sorted in the last dimension, make sure its (None,) in this case
    # or that it's (None, time_length) in the case of many times per sample
    sortedTime, sortedIndices = tf.math.top_k(yTime, n, True)    
    sortedStatus = K.gather(yStatus, sortedIndices)
    sortedPreds = K.gather(yPred, sortedIndices)

    #do the calculations
    exp = K.exp(sortedPreds)
    sums = K.cumsum(exp)  #this will have the sum for j >= i in the loop
    logsums = K.log(sums)

    return K.sum(sortedStatus * sortedPreds - logsums)

Well, I've got to give you the bounty since you answered the original question, but it didn't apply to my situation unfortunately. You can think of my case as each sample having a variable number of times, which I use to set up a dynamic programming matrix of shape (num_times_true, num_times_pred). This might be best answered in a different question. — kjohnsen, Nov 29 '19 at 22:14
Usually having a multiplication by a mask (that is a matrix with zeros and ones corresponding to what you want to use/discard) does the job. I don't believe you can have a batch with different sizes anyway. So a formulation like this for each batch might work ok. — Daniel Möller, Nov 30 '19 at 18:26
One possibility for looping is using `tf.split` for the batch dimension and then looping each resulting tensor, but this is terrible for performance. — Daniel Möller, Nov 30 '19 at 18:27