How to force LineByLineTextDataset split text corpus by words rather than symbols

Question

Based on https://github.com/huggingface/tokenizers/issues/244 question I'm trying to complete my request to use WordLevel tokenizer with roberta transformers model. My vocabulary containts numbers as string and special tokens. I have some issue and I can localize what is wrong - but don't know how to fix it. The situation is following:

tokenizer = RobertaTokenizerFast.from_pretrained("wordlevel", max_len=num_secs_max)
dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./optiver.txt",
    block_size=128,
)

I see that LineByLineTextDataset splits numbers on separate digits - and it is wrong for me. I see that it is result of the tokenizer.batch_encode_plus working. I have found the advice that I need to add is_split_into_words = True parameter when construct RobertaTokenizerFast - but I didn't have success. Please explain me how split my corpus by words not symbols...

Below more details about used code:

from tokenizers.implementations import BaseTokenizer
class WordLevelBertTokenizer(BaseTokenizer):
    """ WordLevelBertTokenizer
    Represents a simple word level tokenization for BERT.
    """

    def __init__(
        self,
        vocab_file: str,
    ):
        tokenizer = Tokenizer(WordLevel.from_file(vocab_file))
        tokenizer.pre_tokenizer = pre_tokenizers.WhitespaceSplit()
        
        if vocab_file is not None:
            sep_token_id = tokenizer.token_to_id(str("</s>"))
            if sep_token_id is None:
                raise TypeError("sep_token not found in the vocabulary")
            cls_token_id = tokenizer.token_to_id(str("<s>"))
            if cls_token_id is None:
                raise TypeError("cls_token not found in the vocabulary")

            tokenizer.post_processor = BertProcessing(
                (str("</s>"), sep_token_id), (str("<s>"), cls_token_id)
            )

        parameters = {
            "model": "WordLevel",
            "sep_token": "</s>",
            "cls_token": "<s>",
            "pad_token": "<pad>",
            "mask_token": "<mask>",
        }

        super().__init__(tokenizer, parameters)


from transformers import RobertaConfig

tokenizer = WordLevelBertTokenizer("./wordlevel/vocab.json")

config = RobertaConfig(
    vocab_size=tokenizer.get_vocab_size(),
    max_position_embeddings=tokenizer.get_vocab_size(),
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("wordlevel", max_len=num_secs_max, add_prefix_space=True)

from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)

print(f'Num of model parameters = {model.num_parameters()}')

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./optiver.txt",
    block_size=128,
)

Also simple test for batch_encode_plus test = [['1234']] tokenizer.batch_encode_plus(t, is_split_into_words = True) output: {'input_ids': [[1224, 2, 3, 4, 5, 1225]], 'attention_mask': [[1, 1, 1, 1, 1, 1]]} It seems this output is not I want tokenizer splits number on separate digits. P.S. Here is fragment of my vocab file: {"0":1, "1":2, "2":3, "3":4, "4":5, "5":6, "6":7, "7":8, .... "1220":1221, "1221":1222, "":1223, "~~":1224, "~~":1225, "":1226}

Difficult to answer without having your tokenizer. Can you upload your tokenizer files somewhere and gives us an example of `optiver.txt` and add the output you expect? — cronoik, Aug 28 '21 at 14:56

How to force LineByLineTextDataset split text corpus by words rather than symbols

0 Answers0