I'm trying to use SimpleTransformers default setup to do multi-task learning.
I am using the example from their website here
The code looks like below:
import logging
import pandas as pd
from simpletransformers.t5 import T5Model, T5Args
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)
train_data = [
["binary classification", "Anakin was Luke's father" , 1],
["binary classification", "Luke was a Sith Lord" , 0],
["generate question", "Star Wars is an American epic space-opera media franchise created by George Lucas, which began with the eponymous 1977 film and quickly became a worldwide pop-culture phenomenon", "Who created the Star Wars franchise?"],
["generate question", "Anakin was Luke's father" , "Who was Luke's father?"],
]
train_df = pd.DataFrame(train_data)
train_df.columns = ["prefix", "input_text", "target_text"]
eval_data = [
["binary classification", "Leia was Luke's sister" , 1],
["binary classification", "Han was a Sith Lord" , 0],
["generate question", "In 2020, the Star Wars franchise's total value was estimated at US$70 billion, and it is currently the fifth-highest-grossing media franchise of all time.", "What is the total value of the Star Wars franchise?"],
["generate question", "Leia was Luke's sister" , "Who was Luke's sister?"],
]
eval_df = pd.DataFrame(eval_data)
eval_df.columns = ["prefix", "input_text", "target_text"]
model_args = T5Args()
model_args.num_train_epochs = 200
model_args.no_save = True
model_args.evaluate_generated_text = False
model_args.evaluate_during_training = False
model_args.evaluate_during_training_verbose = False
model_args.use_multiprocessing = False
model_args.use_multiprocessing_for_evaluation = False
model = T5Model("t5", "t5-base", args=model_args)
def count_matches(labels, preds):
print(labels)
print(preds)
return sum([1 if label == pred else 0 for label, pred in zip(labels, preds)])
model.train_model(train_df, show_running_loss=True)
I'm not even using the eval_df
(though I plan on using it in my real code) at the moment because it wasn't setup properly in their code. In this super simple setup I would think that the library would just work. However after trying on two systems (one Windows, one Linux, both latest version of SimpleTransformers) I get the following error:
File "C:\Users\name\AppData\Local\Programs\Python\Python38\lib\site-packages\simpletransformers\t5\t5_utils.py", line 175, in <listcomp>
preprocess_data(d) for d in tqdm(data, disable=args.silent)
File "C:\Users\name\AppData\Local\Programs\Python\Python38\lib\site-packages\simpletransformers\t5\t5_utils.py", line 81, in preprocess_data
batch = tokenizer.prepare_seq2seq_batch(
File "C:\Users\name\AppData\Local\Programs\Python\Python38\lib\site-packages\transformers\tokenization_utils_base.py", line 3282, in prepare_seq2seq_batch
labels = self(
File "C:\Users\name\AppData\Local\Programs\Python\Python38\lib\site-packages\transformers\tokenization_utils_base.py", line 2262, in __call__
raise ValueError(
ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).
I'm using the exact setup and all of the input DataFrames
have strings in them.
Can anyone help figure out why this basic setup fails? Thanks.