How do I know which parameters to use with a pretrained Tokenizer?

Question

I must be missing something ...
I want to use a pretrained model with HuggingFace:

transformer_name = "Geotrend/distilbert-base-fr-cased"  # Or whatever model
model = AutoModelForSequenceClassification.from_pretrained(transformer_name, num_labels=5)
tokenizer = AutoTokenizer.from_pretrained(transformer_name)

Now that I have my model and my tokenizer, I need to tokenize my dataset, but I don't know which parameters (padding, truncation, max_length) to use with my Tokenizer.

Some examples just call the tokenizer tokenizer(data), others use truncation only tokenizer(data, truncation=True), and others will use many parameters tokenizer(data, padding=True, truncation=True, return_tensors='pt', max_length=512).

As I am reloading a pretrained Tokenizer, I would have love it to use the same parameters as in the original training process. How do I know which parameters to use ?
My understanding is that I always need to truncate my data and leave max_length to None so that my sequences length will always be lower than the model's maximum length. Is that it ? Does leaving max_length to None makes it backup on the model's maximum length ?
And what should I do with padding ? As I am using a Trainer object for training with a DataCollatorWithPadding should I set padding to False to reduce the memory impact and let the collator pad my batches ?
Final question : what should I do if I use a TextClassificationPipeline for inference ? Should I specify these parameters (padding, etc.) ? Will the pipeline handle it for me ?

So I tested my `TextClassificationPipeline` with a very long sentence and it crashes as the tokenized sequence is too long. So I guess I should at least set truncating to True in my pipeline with `tokenizer_kwargs` in the `__call__` method. — Alexandre GAREL, Dec 02 '22 at 16:02

score 1 · Accepted Answer · answered Dec 03 '22 at 14:04

The choice on whether to use padding and truncation depends on the model you are fine-tuning and on your training process, and not on the pretrained tokenizer.
Tranformer-based models have a constraint on the number of tokens the model can process, so generally yes that's it. Yes, when max_length is None then the maximum acceptable input length for the model is considered. (see docs).
Yes, you should not pad the input sequence if you use DataCollatorWithPadding. More about it in this video.
As you already noticed, you have to specify them yourself when you pass your input text to the pipeline.

Thanks for the confirmation and the provided docs / video :) — Alexandre GAREL, Dec 07 '22 at 14:15

How do I know which parameters to use with a pretrained Tokenizer?

1 Answers1