0

I'm trying to tokenize my dataset with the following preprocessing function. I've already donlowaded with AutoTokenizer from the Spanish BERT version.

`

max_input_length = 280
max_target_length = 280
source_lang = "es"
target_lang = "en"
prefix = "translate spanish_to_women to spanish_to_men: "

def preprocess_function(examples):
    inputs = [prefix + ex for ex in examples["mujeres_tweet"]]
    targets = [ex for ex in examples["hombres_tweet"]]

    model_inputs = tokz(inputs,
                        padding=True, 
                        truncation=True,
                        max_length=max_input_length,
                        return_tensors = 'pt'
                        )

    # Setup the tokenizer for targets
    with tokz.as_target_tokenizer():
        labels = tokz(targets, 
                      padding=True, 
                      truncation=True,
                      max_length=max_target_length,
                      return_tensors = 'pt'
                      )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

`

And I get the following error when trying to pass my dataset object through the function. enter image description here

I've already tried dropping the columns that have strings. I've seen also that when I do not set the return_tensors it does tokenize my dataset (but later on I have the same problem when trying to train my BERT model. Anyone knows what might be going on? *inserts crying face

Also, I've tried tokenizing it without the return_tensors and then doing set_format but it returns and empty dataset object *inserts another crying face.

My Dataset looks like the following

And an example of the inputs

So that I just do:

tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
paaoogh
  • 1
  • 2

0 Answers0