The softmax function obtains the weights and then MatMul with V. Are the weights stored anywhere? Or how the learning process happened if the weights are not stored or used on the next round? Moreover, the linear transformation does not use the weights!
Source code: https://github.com/fawazsammani/chatbot-transformer/blob/master/models.py