Theoretical underpinning behind Hardmax operator

Question

In the tensor flow Github repository, in the file attentionwrapper.py, hardmax operator has been defined. On the docs, it has been mentioned tf.contrib.seq2seq.hardmax

I want to know what's the theoretical underpinning behind providing this functionality for hardmax operator. Prima facie google searches for past few weeks haven't led me to concrete understanding of the concept.

If softmax is differentiable (soft), why would hardmax be ever used? If it can't be used in back propagation (due to non-differentiability required in gradient calculation), where else can it be used?
Reinforcement learning literature talks about Soft vs Hard attention. However I couldn't see concrete examples nor explanations of where the tf.contrib.seq2seq.hardmax can be actually used in some RL model.
By the looks of it, since it is mentioned in seq2seq, it should be obviously having some application in Natural Language Processing. But exactly where? There are tonnes of NLP tasks. Couldn't find any direct task SOTA algorithm which uses hardmax.

score 1 · Answer 1 · answered Nov 19 '18 at 03:05

Hardmax is used when you have no choice but to make a decision nonprobabalistically. For example, when you are using a model to generate a neural architecture as in neural module networks, you have to make a discrete choice. To make this trainable (since this would be non-differentiable as you state), you can use REINFORCE (an algorithm in RL) to train via policy gradient and estimate this loss contribution via Monte Carlo sampling. Neural module networks are an NLP construct and depend on seq2seq. I'm sure there are many examples, but this is one that immediately came to mind.

Could you point to any working SOTA or literature that uses Hardmax? I couldn't find any upon decent search. — Chaitanya Bapat, Nov 20 '18 at 01:18

Theoretical underpinning behind Hardmax operator

1 Answers1