I'm trying to tokenize my dataset with the following preprocessing function. I've already donlowaded with AutoTokenizer from the Spanish BERT version.
`
max_input_length = 280
max_target_length = 280
source_lang = "es"
target_lang = "en"
prefix = "translate spanish_to_women to spanish_to_men: "
def preprocess_function(examples):
inputs = [prefix + ex for ex in examples["mujeres_tweet"]]
targets = [ex for ex in examples["hombres_tweet"]]
model_inputs = tokz(inputs,
padding=True,
truncation=True,
max_length=max_input_length,
return_tensors = 'pt'
)
# Setup the tokenizer for targets
with tokz.as_target_tokenizer():
labels = tokz(targets,
padding=True,
truncation=True,
max_length=max_target_length,
return_tensors = 'pt'
)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
`
And I get the following error when trying to pass my dataset object through the function.
I've already tried dropping the columns that have strings. I've seen also that when I do not set the return_tensors it does tokenize my dataset (but later on I have the same problem when trying to train my BERT model. Anyone knows what might be going on? *inserts crying face
Also, I've tried tokenizing it without the return_tensors and then doing set_format but it returns and empty dataset object *inserts another crying face.
My Dataset looks like the following
So that I just do:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)