How to go around truncating long sentences with Hugginface Tokenizers?

Question

I am new to tokenizers. My understanding is that the truncate attribute just cuts the sentences. But I need the whole sentence for context.

For example, my sentence is :

"Ali bin Abbas'ın  Kitab Kamilü-s Sina adlı eseri daha sonra 980 yılında nasıl adlandırılmıştır?  Ali bin Abbas'ın eseri Rezi'nin hangi isimli eserinden daha özlü ve daha sistematikdir?  Ali bin Abbas'ın Kitab Kamilü-s Sina adlı eseri İbn-i Sina'nın hangi isimli eserinden daha uygulamalı bir biçimde yazılmıştır? Kitab el-Maliki Avrupa'da Constantinus Africanus tarafından hangi dile çevrilmiştir? Kitab el-Maliki'nin ilk bölümünde neye ağırlık verilmiştir?

But when I use max_length=64, truncation=True and pad_to_max_length=True for my encoder(as suggested in the internet), half of sentence is being gone:

▁Ali', '▁bin', '▁Abbas', "'", 'ın', '▁Kitab', '▁Kami', 'lü', '-', 's', '▁Sina', '▁ad', 'lı', '▁es', 'eri', '▁daha', '▁sonra', '▁980', '▁yıl', 'ında', '▁na', 'sıl', '▁adlandır', 'ılmıştır', '?', '▁', '<sep>', '▁Ali', '▁bin', '▁Abbas', "'", 'ın', '▁es', 'eri', '▁Rez', 'i', "'", 'nin', '▁', 'hangi', '▁is', 'imli', '▁es', 'erinden', '▁daha', '▁', 'özlü', '▁ve', '▁daha', '▁sistema', 'tik', 'dir', '?', '▁', '<sep>', '▁Ali', '▁bin', '▁Abbas', "'", 'ın', '▁Kitab', '▁Kami', 'lü', '</s>']

And when I increase max length, CUDA is running out of memory of course. What should be my approach for long texts in the dataset?

My code for encoding:

input_encodings = tokenizer.batch_encode_plus(
    example_batch['context'], 
    max_length=512, 
    add_special_tokens=True,
    truncation=True, 
    pad_to_max_length=True)

target_encodings = tokenizer.batch_encode_plus(
    example_batch['questions'],
    max_length=64, 
    add_special_tokens=True,
    truncation=True,
    pad_to_max_length=True)

score 1 · Answer 1 · answered Jul 20 '22 at 07:56

Yes, the truncate attribute just keeps the given number of subwords from the left. The workaround depends on the task you are solving and the data that you use.

Are the long sequence frequent in your data? If not, you can just safely throw away the instances because it is unlikely that the model would learn to generalize for long sequences anyway.

If you really need the long context, you have plenty of options:

Decrease the batch size (and perhaps do updates once after several batches).
Make the model smaller: use either a smaller dimension or fewer layers.
Use a different architecture: Transformers need quadratic memory w.r.t sequence length. Wouldn't an LSTM or CNN do the job? What architectures for long sequences (e.g., Reformer, Longformer).
If you need to use a pre-trained BERT-like model and there is no model of the size that would fit your needs, you can distill a smaller model or a model with a more suitable architecture yourself.
Perhaps you can split the input. In tasks like answer span selection, you can split the text where you are looking for an answer into smaller chunks and search in the chunks independently.

How to go around truncating long sentences with Hugginface Tokenizers?

1 Answers1