2

I have a GCMLE experiment which has three learning objectives (consider these Task A, Task B, and Task C) within a single model_fn(). The inputs for all 3 objectives are the same (reading a body of text) and I would like to produce three separate predictions. However, for Task C I would like to properly mask some of the examples in the batch (~20% across each batch). Is the proper way to do this by simply weighting those samples that I want to mask by zero? Consider this loss function..

lossA = tf.reduce_mean(tf.losses.sparse_softmax_cross_entropy(
            labels=labelsA, logits=logitsA))

lossB = tf.reduce_mean(tf.losses.sparse_softmax_cross_entropy(
            labels=labelsB, logits=logitsB))

mask_weights = tf.to_float(tf.equal(x, y)) # returns 1 if x equals y, returns 0 if x != y
lossC = tf.reduce_mean(tf.losses.sparse_softmax_cross_entropy(
            labels=labelsC, logits=logitsC, weights=mask_weights))

loss = lossA + lossB + lossC

Essentially what I am trying to do is mask any samples in the batch where x != y so that there are no gradient updates to the model based on these examples as they relate to taskC. Is this anywhere near the desired effect? Is there a better way to implement the desired behavior?

I realize that I could split these up into separate experiments, but I would like to be able to have a shared embedding and also a single graph which I can upload into the GCMLE prediction service.

reese0106
  • 2,011
  • 2
  • 16
  • 46
  • I don't think combining the three tasks is a good idea. I think you are assuming that all the tasks will be simultaneously optimized. But, lossc functions more like a regularization term on loss a. So you will get the best loss a subject to having a reasonable loss c. – Lak Oct 28 '17 at 04:43
  • @LakLakshmanan thanks for the response! I actually think that this regularization is the intended behavior that I am looking to experiment with. I was inspired by this blogpost: http://ruder.io/multi-task-learning-nlp/ so I would like to introduce auxiliary tasks and jointly train them. Given this context, is there still a different way that you would recommend implementing this in TF? I know that Keras has multi-task learning built in: https://keras.io/getting-started/functional-api-guide/#multi-input-and-multi-output-models and I have seen good results I would like to replicate with TF. – reese0106 Oct 28 '17 at 13:27
  • @LakLakshmanan irrespective of the way that I combine the loss, would passing a weight of zero for lossC effectively mask those samples from resulting in any gradient updates or would you suggest a separate way to mask samples within a custom loss function? – reese0106 Oct 28 '17 at 13:31
  • Interesting overview of multitask learning! Thanks for sharing. The problem you have is that in your case, the effective batch size for loss c is now lower. Gradient updates always apply to all weights, but they will be heavily focused on loss a and loss b. – Lak Oct 28 '17 at 15:23
  • That makes sense! So while I recognize that there are likely issues with reducing the effective batch size of C (I will consider alternatives that do not do that based on your advice) would your response also mean that multiplying by a weight of zero would be the best implementation if I were to try this approach and see what happens? – reese0106 Oct 28 '17 at 16:40
  • 1
    Yes, applying a binary mask is the best approach. That is how tf.layers.dropout is implemented, for example. – Lak Oct 28 '17 at 17:06

1 Answers1

0

To summarize the comments -- Applying a binary mask to the loss function as described in the post seems to be the appropriate way to mask the loss function. However, there may be other unintended consequences from reducing the effetive batch size of C that would discourage this approach.

reese0106
  • 2,011
  • 2
  • 16
  • 46