Based on https://github.com/huggingface/tokenizers/issues/244 question I'm trying to complete my request to use WordLevel tokenizer with roberta transformers model. My vocabulary containts numbers as string and special tokens. I have some issue and I can localize what is wrong - but don't know how to fix it. The situation is following:
tokenizer = RobertaTokenizerFast.from_pretrained("wordlevel", max_len=num_secs_max)
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="./optiver.txt",
block_size=128,
)
I see that LineByLineTextDataset splits numbers on separate digits - and it is wrong for me. I see that it is result of the tokenizer.batch_encode_plus working. I have found the advice that I need to add is_split_into_words = True parameter when construct RobertaTokenizerFast - but I didn't have success. Please explain me how split my corpus by words not symbols...
Below more details about used code:
from tokenizers.implementations import BaseTokenizer
class WordLevelBertTokenizer(BaseTokenizer):
""" WordLevelBertTokenizer
Represents a simple word level tokenization for BERT.
"""
def __init__(
self,
vocab_file: str,
):
tokenizer = Tokenizer(WordLevel.from_file(vocab_file))
tokenizer.pre_tokenizer = pre_tokenizers.WhitespaceSplit()
if vocab_file is not None:
sep_token_id = tokenizer.token_to_id(str("</s>"))
if sep_token_id is None:
raise TypeError("sep_token not found in the vocabulary")
cls_token_id = tokenizer.token_to_id(str("<s>"))
if cls_token_id is None:
raise TypeError("cls_token not found in the vocabulary")
tokenizer.post_processor = BertProcessing(
(str("</s>"), sep_token_id), (str("<s>"), cls_token_id)
)
parameters = {
"model": "WordLevel",
"sep_token": "</s>",
"cls_token": "<s>",
"pad_token": "<pad>",
"mask_token": "<mask>",
}
super().__init__(tokenizer, parameters)
from transformers import RobertaConfig
tokenizer = WordLevelBertTokenizer("./wordlevel/vocab.json")
config = RobertaConfig(
vocab_size=tokenizer.get_vocab_size(),
max_position_embeddings=tokenizer.get_vocab_size(),
num_attention_heads=12,
num_hidden_layers=6,
type_vocab_size=1,
)
from transformers import RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained("wordlevel", max_len=num_secs_max, add_prefix_space=True)
from transformers import RobertaForMaskedLM
model = RobertaForMaskedLM(config=config)
print(f'Num of model parameters = {model.num_parameters()}')
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="./optiver.txt",
block_size=128,
)
Also simple test for batch_encode_plus
test = [['1234']]
tokenizer.batch_encode_plus(t, is_split_into_words = True)
output:
{'input_ids': [[1224, 2, 3, 4, 5, 1225]], 'attention_mask': [[1, 1, 1, 1, 1, 1]]}
It seems this output is not I want tokenizer splits number on separate digits.
P.S. Here is fragment of my vocab file:
{"0":1, "1":2, "2":3, "3":4, "4":5, "5":6, "6":7, "7":8, .... "1220":1221, "1221":1222, "":1223, "":1224, "":1225, "":1226}