I am trying to train a BERT model from scratch using this blog post: https://huggingface.co/blog/how-to-train
I am trying to train my model on random walks from graph data. Essentially the nodes are the words and going from one node to the next forms the sentence. Because of this I don't want to break up any of the words (nodes) into subparts like you would for normal language because the nodes are just represented by numbers. So I tried creating my own tokenizer by first creating a custom vocab.json file that lists all of the words by frequency in a dictionary and then wrote a custom tokenizer:
from transformers.tokenization_utils import PreTrainedTokenizer
class RandomWalkTokenizer(PreTrainedTokenizer):
#copied rest from BertTokenizer
def batch_encode_plus(self, text, **kwargs):
"""
text: must be a list of lines you want to encode
"""
tokenized_lines = []
for i in text:
tokenized_lines.append(self._tokenize(i))
return {'input_ids': tokenized_lines}
def _tokenize(self, text):
if type(text) == str:
tokenized_text = []
for i in text.split(' '):
tokenized_text.append(self.vocab.get(i, "[UNK]"))
return tokenized_text
Then create the dataset by:
from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="./walks.txt",
block_size=128,
)
With walks.txt just being lines of space separated numbers like:
10096244043 10079425660 10111609222 10111609222
2462116941 10015483987 2462116941 10012741942
The lines are much longer but follow the same pattern
Start the model with:
from transformers import RobertaConfig
config = RobertaConfig(
vocab_size=tokenizer.vocab_size,
max_position_embeddings=114,
num_attention_heads=12,
num_hidden_layers=2,
type_vocab_size=1
)
from transformers import RobertaForMaskedLM
model = RobertaForMaskedLM(config=config)
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./outBert",
overwrite_output_dir=True,
num_train_epochs=1,
save_steps=10_000,
save_total_limit=2,
#per_gpu_train_batch_size=64,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
prediction_loss_only=True,
)
trainer.train()
But I keep getting:
ValueError: You have to specify either input_ids or inputs_embeds
I think my fundamental issue is that I'm not sure if I'm creating the tokenizer correctly and I'm not sure how I should be putting in the masks into the training dataset when it gets created. Any tips on how to create a simple tokenizer that doesn't break down any words into subpieces would be much appreciated.