Is there a way to use GPU instead of CPU for BERT tokenization?

Question

I'm using a BERT tokenizer over a large dataset of sentences (2.3M lines, 6.53bn words):

#creating a BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)

#encoding the data using our tokenizer
encoded_dict = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].comment.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

As-is, it runs on CPU and only on 1 core. I tried to parallelize, but that will only speed up processing by 16x with my 16 cores CPU, which will still make it run for ages if I want to tokenize the full dataset.

Is there any way to make it run on GPU or to speed this up some other way?

EDIT: I have also tried using a fast tokenizer:

#creating a BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)

Then passing the output to my batch_encode_plus:

#encoding the data using our tokenizer
encoded_dict = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].comment.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

But the batch_encode_plus returns the following error:

TypeError: batch_text_or_text_pairs has to be a list (got <class 'numpy.ndarray'>)

Have you tried the [BertTokenizerFast](https://huggingface.co/transformers/model_doc/bert.html#berttokenizerfast)? — cronoik, Jan 23 '21 at 14:58
@cronoik is it only a different implementation of the tokenizer or would I get the same output? — Vincent Teyssier, Jan 24 '21 at 09:37
You will get the same output with some more features like [char_to_token](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding.char_to_token). They are consuming more memory but they are much faster. — cronoik, Jan 24 '21 at 12:46
just tried and it doesn't seem like batch_encode_plus accept fast tokenizers output — Vincent Teyssier, Jan 26 '21 at 11:57
Not sure if this is just a mistake, but do you mean that the fast tokenizer produces a different output? Please give me a reproducible example by editing your question. — cronoik, Jan 26 '21 at 13:12
sry yes I mean does not accept the fast tokenizer output, editing question now — Vincent Teyssier, Jan 26 '21 at 15:30

Is there a way to use GPU instead of CPU for BERT tokenization?

0 Answers0

Linked