T5 fine tuned model outputs instead of curly braces and other special characters

Question

First of I'll start with saying that I'm a beginner when it comes to machine learning as a whole and transformers so my apologies if it's a dumb question. I've been fine tuning t5 for the task of generating mongodb queries, but I was met with this strange output that doesn't look like as the intended one.

inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, return_tensors="pt")

output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=10, max_length=64)

decoded_output = tokenizer.batch_decode(output, skip_special_tokens=False)[0]

print(decoded_output)

predicted_Query = nltk.sent_tokenize(decoded_output.strip())[0]

print(predicted_Query)

Gives the following output:

<pad> db.movies.find(<unk>"title": "The Poor Little Rich Girl"<unk>, <unk>"writers": 1<unk>)</s>

<pad> db.movies.find(<unk>"title": "The Poor Little Rich Girl"<unk>, <unk>"writers": 1<unk>)</s>

The query is correct for the most part, I assume that the <unk> token is supposed to be curly braces but the model wasn't able to understand them (as in OOV case). Note that the dataset that was used to fine tune it contain curly braces in the output so I'm confused on how it couldn't recognize it during the testing. Would it be a problem with the tokenizer? If it's the case, could I expend the vocab by adding some new tokens ? I'm not asking for an answer (although it's welcomed) but some guidance would be appreciated. Thank you for your time.

I tested if the tokenizer can handle curly braces and it showed it can. Again I'm new to this so I'm not really sure if I understand the problem well.

Did you use a trained tokenizer (if yes, which one)? What is the output of `tokenizer('text with { and } because we want to { know their } ids').input_ids` — cronoik, Mar 27 '23 at 20:23
@cronoik I've used the T5 tokenizer provided by hugging face library. I figured out the problem, T5 tokenizer doesn't recognize some characters like curly braces and other ones so I had to expend the vocab. — zaki Miho, Mar 27 '23 at 23:01

score 3 · Answer 1 · answered Mar 27 '23 at 23:11

After some research I've found a solution. T5 tokenizer vocab was missing a few characters like curly braces and others, so I used the following to add them.

from transformers import  AutoModel
new_words = ['{', '}']
 
model = AutoModel.from_pretrained("t5-base")

tokenizer.add_tokens(new_words)

model.resize_token_embeddings(len(tokenizer))

T5 fine tuned model outputs instead of curly braces and other special characters

1 Answers1