0

I'm using a BERT tokenizer over a large dataset of sentences (2.3M lines, 6.53bn words):

#creating a BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)

#encoding the data using our tokenizer
encoded_dict = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].comment.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

As-is, it runs on CPU and only on 1 core. I tried to parallelize, but that will only speed up processing by 16x with my 16 cores CPU, which will still make it run for ages if I want to tokenize the full dataset.

Is there any way to make it run on GPU or to speed this up some other way?

EDIT: I have also tried using a fast tokenizer:

#creating a BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)

Then passing the output to my batch_encode_plus:

#encoding the data using our tokenizer
encoded_dict = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].comment.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

But the batch_encode_plus returns the following error:

TypeError: batch_text_or_text_pairs has to be a list (got <class 'numpy.ndarray'>)

Vincent Teyssier
  • 2,146
  • 5
  • 25
  • 62
  • 1
    Have you tried the [BertTokenizerFast](https://huggingface.co/transformers/model_doc/bert.html#berttokenizerfast)? – cronoik Jan 23 '21 at 14:58
  • @cronoik is it only a different implementation of the tokenizer or would I get the same output? – Vincent Teyssier Jan 24 '21 at 09:37
  • You will get the same output with some more features like [char_to_token](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding.char_to_token). They are consuming more memory but they are much faster. – cronoik Jan 24 '21 at 12:46
  • just tried and it doesn't seem like batch_encode_plus accept fast tokenizers output – Vincent Teyssier Jan 26 '21 at 11:57
  • Not sure if this is just a mistake, but do you mean that the fast tokenizer produces a different output? Please give me a reproducible example by editing your question. – cronoik Jan 26 '21 at 13:12
  • sry yes I mean does not accept the fast tokenizer output, editing question now – Vincent Teyssier Jan 26 '21 at 15:30
  • Can you please give us a small artificial example of `df`? – cronoik Jan 26 '21 at 16:27

0 Answers0