I must be missing something ...
I want to use a pretrained model with HuggingFace:
transformer_name = "Geotrend/distilbert-base-fr-cased" # Or whatever model
model = AutoModelForSequenceClassification.from_pretrained(transformer_name, num_labels=5)
tokenizer = AutoTokenizer.from_pretrained(transformer_name)
Now that I have my model and my tokenizer, I need to tokenize my dataset, but I don't know which parameters (padding, truncation, max_length) to use with my Tokenizer.
Some examples just call the tokenizer tokenizer(data)
, others use truncation only tokenizer(data, truncation=True)
, and others will use many parameters tokenizer(data, padding=True, truncation=True, return_tensors='pt', max_length=512)
.
As I am reloading a pretrained Tokenizer, I would have love it to use the same parameters as in the original training process. How do I know which parameters to use ?
My understanding is that I always need to truncate my data and leave
max_length
toNone
so that my sequences length will always be lower than the model's maximum length. Is that it ? Does leavingmax_length
toNone
makes it backup on the model's maximum length ?And what should I do with
padding
? As I am using aTrainer
object for training with aDataCollatorWithPadding
should I setpadding
toFalse
to reduce the memory impact and let the collator pad my batches ?Final question : what should I do if I use a
TextClassificationPipeline
for inference ? Should I specify these parameters (padding, etc.) ? Will the pipeline handle it for me ?