0

In the tensor flow Github repository, in the file attentionwrapper.py, hardmax operator has been defined. On the docs, it has been mentioned tf.contrib.seq2seq.hardmax

I want to know what's the theoretical underpinning behind providing this functionality for hardmax operator. Prima facie google searches for past few weeks haven't led me to concrete understanding of the concept.

  1. If softmax is differentiable (soft), why would hardmax be ever used? If it can't be used in back propagation (due to non-differentiability required in gradient calculation), where else can it be used?

  2. Reinforcement learning literature talks about Soft vs Hard attention. However I couldn't see concrete examples nor explanations of where the tf.contrib.seq2seq.hardmax can be actually used in some RL model.

  3. By the looks of it, since it is mentioned in seq2seq, it should be obviously having some application in Natural Language Processing. But exactly where? There are tonnes of NLP tasks. Couldn't find any direct task SOTA algorithm which uses hardmax.

Chaitanya Bapat
  • 3,381
  • 6
  • 34
  • 59

1 Answers1

1

Hardmax is used when you have no choice but to make a decision nonprobabalistically. For example, when you are using a model to generate a neural architecture as in neural module networks, you have to make a discrete choice. To make this trainable (since this would be non-differentiable as you state), you can use REINFORCE (an algorithm in RL) to train via policy gradient and estimate this loss contribution via Monte Carlo sampling. Neural module networks are an NLP construct and depend on seq2seq. I'm sure there are many examples, but this is one that immediately came to mind.

rvd
  • 558
  • 2
  • 9