I have been playing around with tensorflow (CPU), and some language model'ing - and it have been a blast so far - everything working great.
But after watching my old CPU slowly getting killed from all the model-training - i decided it was time to finally get some use out of my RTX 2080. I have been following the guide from washinton university:. Pretty quickly i got tensorflow-gpu running, ran it on some light grade-prediction and stuff like that.
But when i got to running GPT2 language model, i ran into some minor problems. I start by tokenizing the data:
from tokenizers.models import BPE
from tokenizers import Tokenizer
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
from tokenizers.normalizers import NFKC, Sequence
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.trainers import BpeTrainer
class BPE_token(object):
def __init__(self):
self.tokenizer = Tokenizer(BPE())
self.tokenizer.normalizer = Sequence([
NFKC()
])
self.tokenizer.pre_tokenizer = ByteLevel()
self.tokenizer.decoder = ByteLevelDecoder()
def bpe_train(self, paths):
trainer = BpeTrainer(vocab_size=50000, show_progress=True, inital_alphabet=ByteLevel.alphabet(), special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>"
])
self.tokenizer.train(trainer, paths)
def save_tokenizer(self, location, prefix=None):
if not os.path.exists(location):
os.makedirs(location)
self.tokenizer.model.save(location, prefix)
# ////////// TOKENIZE DATA ////////////
from pathlib import Pa th
import os# the folder 'text' contains all the files
paths = [str(x) for x in Path("./da_corpus/").glob("**/*.txt")]
tokenizer = BPE_token()# train the tokenizer model
tokenizer.bpe_train(paths)# saving the tokenized data in our specified folder
save_path = 'tokenized_data'
tokenizer.save_tokenizer(save_path)
Code above works perfectly and tokenizes the data - just like with tensorflow (CPU). After having my data tokenized i start to train my model - but before it even gets start, i get the following ImportError:
from transformers import GPT2Config, TFGPT2LMHeadModel, GPT2Tokenizer # loading tokenizer from the saved model path
ImportError: cannot import name 'TFGPT2LMHeadModel' from 'transformers' (unknown location)
Transformers package seems to be installed correctly in the site-packages lib, and i seem to be able to use the other transformers - but not TFGPT2LMHeadModel I have read everything on google and hugging.co - tried different versions of tensorflow-gpu, transformers, tokenizers and alot of other packages - - sadly nothing helps.
Packages:
- Python, 3.7.1
- Tensorflow 2.1.0
- Tensorflow-gpu 2.1.0
- Tensorflow-base 2.1.0
- Tensorflow-estimator 2.1.0
- Transformers 4.2.2
- Tokenizers 0.9.4
- cudnn 7.6.5
- cudatoolkit 10.1.243