Fix tokenization to tensors with padding Huggingface

Question

I'm trying to tokenize my dataset with the following preprocessing function. I've already donlowaded with AutoTokenizer from the Spanish BERT version.

`

max_input_length = 280
max_target_length = 280
source_lang = "es"
target_lang = "en"
prefix = "translate spanish_to_women to spanish_to_men: "

def preprocess_function(examples):
    inputs = [prefix + ex for ex in examples["mujeres_tweet"]]
    targets = [ex for ex in examples["hombres_tweet"]]

    model_inputs = tokz(inputs,
                        padding=True, 
                        truncation=True,
                        max_length=max_input_length,
                        return_tensors = 'pt'
                        )

    # Setup the tokenizer for targets
    with tokz.as_target_tokenizer():
        labels = tokz(targets, 
                      padding=True, 
                      truncation=True,
                      max_length=max_target_length,
                      return_tensors = 'pt'
                      )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

`

And I get the following error when trying to pass my dataset object through the function.

I've already tried dropping the columns that have strings. I've seen also that when I do not set the return_tensors it does tokenize my dataset (but later on I have the same problem when trying to train my BERT model. Anyone knows what might be going on? *inserts crying face

Also, I've tried tokenizing it without the return_tensors and then doing set_format but it returns and empty dataset object *inserts another crying face.

My Dataset looks like the following

And an example of the inputs

So that I just do:

tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

I've already edited the post with the examples of the inputs — paaoogh, Nov 03 '22 at 15:18

Fix tokenization to tensors with padding Huggingface

0 Answers0