Invert embedding layer does not update weight

Question

In this post (https://stackoverflow.com/a/64526124/15693663) I see a very simple and short solution to invert the embedding layer. I used the inverse embedding layer, but it does not update the weights in the network. The proposed inverse embedding layer is copied from the post here (bellow):

import torch

embeddings = torch.nn.Embedding(1000, 100)
my_sample = torch.randn(1, 100)
distance = torch.norm(embeddings.weight.data - my_sample, dim=1)
nearest = torch.argmin(distance)

What i did: I used an embedding layer that gets one input and generated 16D output. Then, I add two hidden dense layers (64->16) and one inverse embedding layer. In short

X -> embedding ->dense layer(64D)->dense layer(16D) -> inverse embedding -> X'

X and X' are integer numbers.

To compute the loss, I used torch.norm(X - X'). But it does not update the weights. I can not figure out the problem and why there is no update in weights.

A short implementation is shown bellow:

# lS_o = Offset, lS_i = input number
optimizer = opts['sgd'](parameters, lr=args.learning_rate)
#--------------------------------------

model forward(self, lS_o, lS_i):
   out_emb1 = self.embl_inp(lS_o, lS_i) # 16D == embedding layer
   out_dl1= self.DLyr1(out_emb1) #  64D == Dense Layer 1
   out_dl2 = self.DLyr2(out_dl1) # 16D == Dense Layer 2
   ly  = out_dl2
   distance = self.emb_out.weight.data-ly[i,None] #subtract each row of weight matrix with each row in ly
   out = torch.argmin(torch.norm(distance, dim=1), dim=0)
   return torch.stack(out)


train_ds = Dataset(...
train_ld = DataLoader(train_ds, ...

pbar = tq.tqdm(enumerate(train_ld, total=len(train_ld))
for j, inputBatch in pbar:
    lS_o, lS_i = unpack_batch(inputBatch)
    ae_out = model(lS_o, lS_i, use_gpu=True)
    loss = torch.norm(ae_out - lS_i)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

I have studied the gumbel_softmax because argmin and argmax are not differentiable. So I have changed the code in the "model forward()" function as the following:

distance = torch.norm(distance, dim=1)
out = 1 - torch.nn.functional.gumbel_softmax(dist)

Because I am going to find the minimum, I subtract the output of the gumbel_softmax by 1.

Are you trying to use an autoencoder scheme? Your loss being torch.norm is weird, why aren't you using L1 or MSE loss instead? — Farhood ET, Aug 02 '21 at 06:05
Could you show a little bit more of your code (optimizer, training loop, update logic...)? — Ivan, Aug 02 '21 at 11:55
@Ivan. I added part of my code to the post (post is updated) — Mor Mory, Aug 02 '21 at 14:12
@Ivan. If you need more code, let me know. I will add more portion of my code to the post. It is not complicated. Everything is normal. I get data from dataloader and feed them to autoencoder. Then, I compare the output and input of autoencoder to calculate the loss. — Mor Mory, Aug 02 '21 at 14:20
Yes, please provide some more code. I don't see any issues with what you have provided. It might be that the gradient is not back-propagating properly... A minimal reproducible would be ideal. — Ivan, Aug 02 '21 at 15:05
@Ivan. Thanks for spending time on the problem. I updated the post with more code. In fact, the code is more complex and I tried to write the short version here. I added the main loop and forward function for the network. Also, I tried to describe each line in the forward function to make it clear what is going on in the line. — Mor Mory, Aug 02 '21 at 16:58
@Ivan. The connection of Dense 16 (second dense layer) with x’ is torch.argmin function. I wonder if this method updates the weight matrix in the last embedding layer? I am trying to understand how gradient descent back-propagate to update weights, starting from the embedding layer at the end of the autoencoder. In other words, the output of the second dense layer (16D) is not the input of the embedding layer after. There is no operations like addition, multiplication and so on between these two last layers. Just argmin() is connecting two layers. — Mor Mory, Aug 02 '21 at 17:14
Ok, I see your concern. The method provided in the linked question will not learn a second *inverse* embedding, it will just choose an entry in the embedding layer based on the euclidian distance between the vector you're looking to 'de-embed' and all other embeddings. I think what you're looking to do is to essentially classify the output vectors `out_dl2`, i.e. you could use a linear layer if the dimension count is not too high. — Ivan, Aug 03 '21 at 09:49
@Ivan. Thanks for your reply. The dimension count is high (2 millions). That's why I am trying to use inverse embedding. — Mor Mory, Aug 03 '21 at 10:30
Take a look at [this thread](https://stackoverflow.com/questions/54969646/how-does-pytorch-backprop-through-argmax), I'm afraid you won't be able to update your layer with `argmin`... Let me search for an alternative though. — Ivan, Aug 05 '21 at 05:03
@Ivan. Thanks for your help. I am also thinking of an alternative. I used gumbel_softmax but it does not help. I change the code that I have added to the post (I have updated the post again). New info is added to the end of the post. — Mor Mory, Aug 05 '21 at 17:56

Invert embedding layer does not update weight

0 Answers0