While testing (using PyTorch's MultiheadAttention), I noticed that increasing or decreasing the number of heads of the multi-head attention does not change the total number of learnable parameters of my model.
Is this behavior correct? And if so, why?
Shouldn't the number of heads affect the number of parameters the model can learn?