In the tensor flow Github repository, in the file attentionwrapper.py, hardmax operator has been defined. On the docs, it has been mentioned tf.contrib.seq2seq.hardmax
I want to know what's the theoretical underpinning behind providing this functionality for hardmax operator. Prima facie google searches for past few weeks haven't led me to concrete understanding of the concept.
If softmax is differentiable (soft), why would hardmax be ever used? If it can't be used in back propagation (due to non-differentiability required in gradient calculation), where else can it be used?
Reinforcement learning literature talks about Soft vs Hard attention. However I couldn't see concrete examples nor explanations of where the tf.contrib.seq2seq.hardmax can be actually used in some RL model.
By the looks of it, since it is mentioned in seq2seq, it should be obviously having some application in Natural Language Processing. But exactly where? There are tonnes of NLP tasks. Couldn't find any direct task SOTA algorithm which uses hardmax.