2

I'm dealing with a huge text dataset for content classification. I've implemented the distilbert model and distilberttokenizer.from_pretrained() tokenizer.. This tokenizer is taking incredibly long to tokenizer my text data roughly 7 mins for just 14k records and that's because it runs on my CPU.

Is there any way to force the tokenizer to run on my GPU.

tehem
  • 45
  • 1
  • 5
  • This seems to be a duplicate of [this question](https://stackoverflow.com/questions/65857708/is-there-a-way-to-use-gpu-instead-of-cpu-for-bert-tokenization). – justanyphil Feb 08 '21 at 06:38

1 Answers1

5

Tokenization is string manipulation. It is basically a for loop over a string with a bunch of if-else conditions and dictionary lookups. There is no way this could speed up using a GPU. Basically, the only thing a GPU can do is tensor multiplication and addition. Only problems that can be formulated using tensor operations can be accelerated using a GPU.

The default tokenizers in Huggingface Transformers are implemented in Python. There is a faster version that is implemented in Rust. You can get it either from the standalone package Huggingface Tokenziers or in newer versions of Transformers, they should be available under DistilBertTokenizerFast.

Jindřich
  • 10,270
  • 2
  • 23
  • 44
  • Thank you. I've implemented the fast tokenizer and the performance has increased dramatically. – tehem Feb 09 '21 at 10:12
  • I'm a little confused, https://huggingface.co/docs/tokenizers/python/latest/ states it's got a fast rust implementation, does that mean the python package will use this rust implementation by default? Or perhaps @tehem you could elaborate on what you mean by you implemented the faster tokenizer ? I might be missing something – Kevin Danikowski Aug 21 '21 at 19:25
  • How can we use it in multiple CPU cores? Here it is using a single core. – ton Jul 14 '23 at 13:34