How to get the embeddings from the first 4 layers of pre-trained LLMs such as CodeBERT/GraphCodeBERT?

Question

I need to get the embeddings from a pre-trained LLM. As of now I am doing something like this:

def gen_embeddings(self,code):

    tokenized_input_pos = self.tokenizer(code, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        output = self.model(**tokenized_input_pos)
    embedding = output.last_hidden_state.mean(dim=1).squeeze().tolist()
    if len(code)==1:
        return [embedding]
    else:
        return embedding

As you can see, I am taking the mean of weights from the last hidden state. But this approach is taking a lot of time. I was hoping instead of taking the mean from the last hidden state, if it's possible to get it from the first 4 layers? I know it might affect my model's performance, but for now I am doing a POC kind of thing, so speed is of essence.

score -1 · Answer 1 · edited Jun 07 '23 at 10:31

def gen_embeddings(self, code):
    tokenized_input_pos = self.tokenizer(code, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = self.model(**tokenized_input_pos, output_hidden_states=True)
    
    hidden_states = outputs.hidden_states[:4]  # Extract the hidden states from the first 4 layers
    embeddings = torch.cat(hidden_states, dim=-1)
    embeddings = embeddings.mean(dim=1).squeeze().tolist()
    
    if len(code) == 1:
        return [embeddings]
    else:
        return embeddings

How to get the embeddings from the first 4 layers of pre-trained LLMs such as CodeBERT/GraphCodeBERT?

1 Answers1