0

I am trying to train a BERT model from scratch using this blog post: https://huggingface.co/blog/how-to-train

I am trying to train my model on random walks from graph data. Essentially the nodes are the words and going from one node to the next forms the sentence. Because of this I don't want to break up any of the words (nodes) into subparts like you would for normal language because the nodes are just represented by numbers. So I tried creating my own tokenizer by first creating a custom vocab.json file that lists all of the words by frequency in a dictionary and then wrote a custom tokenizer:

from transformers.tokenization_utils import PreTrainedTokenizer
class RandomWalkTokenizer(PreTrainedTokenizer):

    #copied rest from BertTokenizer

    def batch_encode_plus(self, text, **kwargs):
        """
        text: must be a list of lines you want to encode
        """
        tokenized_lines = []

        for i in text:
            tokenized_lines.append(self._tokenize(i))
        return {'input_ids': tokenized_lines}

    def _tokenize(self, text):
        if type(text) == str:
            tokenized_text = []

            for i in text.split(' '):
                tokenized_text.append(self.vocab.get(i, "[UNK]"))


        return tokenized_text

Then create the dataset by:

from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./walks.txt",
    block_size=128,
)

With walks.txt just being lines of space separated numbers like:

10096244043 10079425660 10111609222 10111609222
2462116941 10015483987 2462116941 10012741942

The lines are much longer but follow the same pattern

Start the model with:

from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=tokenizer.vocab_size,
    max_position_embeddings=114,
    num_attention_heads=12,
    num_hidden_layers=2,
    type_vocab_size=1
)
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./outBert",
    overwrite_output_dir=True,
    num_train_epochs=1,
    save_steps=10_000,
    save_total_limit=2,
    #per_gpu_train_batch_size=64,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    prediction_loss_only=True,
)

trainer.train()

But I keep getting:

ValueError: You have to specify either input_ids or inputs_embeds

I think my fundamental issue is that I'm not sure if I'm creating the tokenizer correctly and I'm not sure how I should be putting in the masks into the training dataset when it gets created. Any tips on how to create a simple tokenizer that doesn't break down any words into subpieces would be much appreciated.

jharkins
  • 191
  • 1
  • 8
  • Can you please add some lines of your `dataset` variable? – cronoik Jun 13 '20 at 12:56
  • I'm not sure what you mean, the above code is everything I used to initialize the `dataset` variable. `vocab.json` is a dict of all of the 'words' (nodes) ranked by frequency, `walks.txt` is a file containing the sentences, which is just space separated numbers. and `dataset[0]` returns a pytorch tensor of the first encoded sentence – jharkins Jun 14 '20 at 19:32
  • Well we can't reproduce your issue with some lines of walks.txt (or to be exact we maybe can but it costs us time). So please show us some lines of walks.txt. – cronoik Jun 15 '20 at 00:13
  • here are some lines from walks.txt: 10070864075 1637322970 10069101191 10079425660 10086576573 10097452332 10057702413 10097452332 10089843726 10069391141 2462116941 10015483987 10012741942 10012273741 10061455520 – jharkins Jun 15 '20 at 15:00
  • Two items per line? Please add this directly to your question. – cronoik Jun 16 '20 at 12:48
  • There is actually 80 per line and there are many lines. I added a quick example in the question – jharkins Jun 16 '20 at 17:20

0 Answers0