6

I am replicating code from this page. I have downloaded the BERT model to my local system and getting sentence embedding.

I have around 500,000 sentences for which I need sentence embedding and it is taking a lot of time.

  1. Is there a way to expedite the process?
  2. Would sending batches of sentences rather than one sentence at a time help?

.

#!pip install transformers
import torch
import transformers
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

corpa=["i am a boy","i live in a city"]



storage=[]#list to store all embeddings

for text in corpa:
    # Add the special tokens.
    marked_text = "[CLS] " + text + " [SEP]"

    # Split the sentence into tokens.
    tokenized_text = tokenizer.tokenize(marked_text)

    # Map the token strings to their vocabulary indeces.
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

    segments_ids = [1] * len(tokenized_text)

    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segments_ids])

    # Run the text through BERT, and collect all of the hidden states produced
    # from all 12 layers. 
    with torch.no_grad():

        outputs = model(tokens_tensor, segments_tensors)

        # Evaluating the model will return a different number of objects based on 
        # how it's  configured in the `from_pretrained` call earlier. In this case, 
        # becase we set `output_hidden_states = True`, the third item will be the 
        # hidden states from all layers. See the documentation for more details:
        # https://huggingface.co/transformers/model_doc/bert.html#bertmodel
        hidden_states = outputs[2]


    # `hidden_states` has shape [13 x 1 x 22 x 768]

    # `token_vecs` is a tensor with shape [22 x 768]
    token_vecs = hidden_states[-2][0]

    # Calculate the average of all 22 token vectors.
    sentence_embedding = torch.mean(token_vecs, dim=0)

    storage.append((text,sentence_embedding))

######update 1

I modified my code based upon the answer provided. It is not doing full batch processing

#!pip install transformers
import torch
import transformers
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

batch_sentences = ["Hello I'm a single sentence",
                    "And another sentence",
                    "And the very very last one"]
encoded_inputs = tokenizer(batch_sentences)


storage=[]#list to store all embeddings
for i,text in enumerate(encoded_inputs['input_ids']):
    
    tokens_tensor = torch.tensor([encoded_inputs['input_ids'][i]])
    segments_tensors = torch.tensor([encoded_inputs['attention_mask'][i]])
    print (tokens_tensor)
    print (segments_tensors)

    # Run the text through BERT, and collect all of the hidden states produced
    # from all 12 layers. 
    with torch.no_grad():

        outputs = model(tokens_tensor, segments_tensors)

        # Evaluating the model will return a different number of objects based on 
        # how it's  configured in the `from_pretrained` call earlier. In this case, 
        # becase we set `output_hidden_states = True`, the third item will be the 
        # hidden states from all layers. See the documentation for more details:
        # https://huggingface.co/transformers/model_doc/bert.html#bertmodel
        hidden_states = outputs[2]


    # `hidden_states` has shape [13 x 1 x 22 x 768]

    # `token_vecs` is a tensor with shape [22 x 768]
    token_vecs = hidden_states[-2][0]

    # Calculate the average of all 22 token vectors.
    sentence_embedding = torch.mean(token_vecs, dim=0)
    print (sentence_embedding[:10])
    storage.append((text,sentence_embedding))

I could update first 2 lines from the for loop to below. But they work only if all sentences have same length after tokenization

tokens_tensor = torch.tensor([encoded_inputs['input_ids']])
segments_tensors = torch.tensor([encoded_inputs['attention_mask']])

moreover in that case outputs = model(tokens_tensor, segments_tensors) fails.

How could I fully perform batch processing in such case?

user2543622
  • 5,760
  • 25
  • 91
  • 159

2 Answers2

12

One of the easiest methods which can accelerate your workflow is batch data processing. In the current implementation, you are feeding ONLY one sentence at each iteration but there is a capability to use batched data!

Now if you are willing to implement this part yourself I highly recommend using tokenizer in this way to prepare your data.

batch_sentences = ["Hello I'm a single sentence",
                    "And another sentence",
                    "And the very very last one"]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)
{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
               [101, 1262, 1330, 5650, 102],
               [101, 1262, 1103, 1304, 1304, 1314, 1141, 102]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1]]}

But there is a simpler approach, using FeatureExtractionPipeline with comprehensive documentation! This would look like this:

from transformers import pipeline

feature_extraction = pipeline('feature-extraction', model="distilroberta-base", tokenizer="distilroberta-base")
features = feature_extraction(["Hello I'm a single sentence",
                               "And another sentence",
                               "And the very very last one"])

UPDATE1 In fact, you changed your code slightly but you're passing samples one at a time yet, not in the batch form. If we want to stick to your implementation batch processing would be something like this:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )
model.eval()
sentences = [ 
              "Hello I'm a single sentence",
              "And another sentence",
              "And the very very last one",
              "Hello I'm a single sentence",
              "And another sentence",
              "And the very very last one",
              "Hello I'm a single sentence",
              "And another sentence",
              "And the very very last one",
            ]
batch_size = 4  
for idx in range(0, len(sentences), batch_size):
    batch = sentences[idx : min(len(sentences), idx+batch_size)]
    
    # encoded = tokenizer(batch)
    encoded = tokenizer.batch_encode_plus(batch,max_length=50, padding='max_length', truncation=True)
  
    encoded = {key:torch.LongTensor(value) for key, value in encoded.items()}
    with torch.no_grad():
        
        outputs = model(**encoded)
        
    
    print(outputs.last_hidden_state.size())

output:

torch.Size([4, 50, 768]) # batch_size * max_length * hidden dim
torch.Size([4, 50, 768])
torch.Size([1, 50, 768]) 

UPDATE2

There are two questions about what has been mentioned about padding the batched-data to maximum length. One, is it able to distrubting the transformer model with irrelevant information? NO, because in the training phase the model has presented with variable-length input sentences in the batched form, and designers have introduced a specific parameter to guide the model on WHERE it should attention! Second, how can you get rid of this garbage data? Using the attention mask parameter you can perform the mean operation only on relevant data!

So the code would be changed to something like this:

for idx in range(0, len(sentences), batch_size):
    batch = sentences[idx : min(len(sentences), idx+batch_size)]
    
    # encoded = tokenizer(batch)
    encoded = tokenizer.batch_encode_plus(batch,max_length=50, padding='max_length', truncation=True)
  
    encoded = {key:torch.LongTensor(value) for key, value in encoded.items()}
    with torch.no_grad():
        
        outputs = model(**encoded)
    lhs = outputs.last_hidden_state
    attention = encoded['attention_mask'].reshape((lhs.size()[0], lhs.size()[1], -1)).expand(-1, -1, 768)
    embeddings = torch.mul(lhs, attention)
    denominator = torch.count_nonzero(embeddings, dim=1)
    summation = torch.sum(embeddings, dim=1)
    mean_embeddings = torch.div(summation, denominator)
meti
  • 1,921
  • 1
  • 8
  • 15
  • There is an important caveat to consider, though: When using pipelines (or generally batched inputs), the outputs will have the *length of your longest input sequence*. Especially when averaging, this means you also average over "irrelevant" tokens which should ideally be ignored! I find this not well documented in the pipelines, so it isn't obvious that this padding occurs at all... – dennlinger Oct 11 '21 at 10:06
  • I didn't know any of these, thank you! @dennlinger – meti Oct 11 '21 at 10:17
  • Sorry for another update, but I dug a bit deeper (the code in the main repository did not give me a clear indication where it would be padded), and it seems that **this behavior changed with version 4.11**. In fact, now the model will process each sample individually, which will result in the original length being preserved (expected behavior in your answer). – dennlinger Oct 11 '21 at 10:47
  • @ meti: Your update faces exactly the problem mentioned by @dennlinger. – cronoik Oct 16 '21 at 16:48
  • 1
    @cronoik did you try the last one? – meti Oct 19 '21 at 10:18
2

About your original question: there is not much you can do. BERT is pretty computationally demanding algorithm. Your best shot is to use BertTokenizerFast instead of the regular BertTokenizer. The "fast" version is much more efficient and you will see the difference for large amounts of text.

Saying that, I have to warn you that averaging BERT word embeddings does not create good embeddings for the sentence. See this post. From your questions I assume you want to do some kind of semantic similarity search. Try using one of those open-sourced models.

igrinis
  • 12,398
  • 20
  • 45
  • is there a way to finetune sentence-bert models on my data? – user2543622 Oct 19 '21 at 16:07
  • 2
    yes you can fine-tune, all you need are relevant sentence pairs, take a look at the training utilities from the sentence-transformers library, the most performant approach is using [multiple negatives ranking loss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss) on NLI data, I'd give that a go – James Briggs Oct 19 '21 at 17:37