I was looking for a way to train my own textual data using GPT-2 & I have found a blog post here: https://www.kaggle.com/code/ashiqabdulkhader/train-gpt-2-on-custom-language
Everything works fine, the model building, dataset building, but it shows very weird texts as output...
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer
output_dir = "kaggle/working/gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(output_dir)
model = TFGPT2LMHeadModel.from_pretrained(output_dir, pad_token_id=tokenizer.eos_token_id)
text = "what is python?"
input_ids = tokenizer.encode(text, return_tensors='tf')
beam_output = model.generate(
input_ids,
max_length=50,
num_beams=5,
temperature=0.7,
no_repeat_ngram_size=2,
num_return_sequences=5
)
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
My corpus data is:
Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.[33]
Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming. It is often described as a "batteries included" language due to its comprehensive standard library.[34][35]
Guido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language and first released it in 1991 as Python 0.9.0.[36] Python 2.0 was released in 2000. Python 3.0, released in 2008, was a major revision not completely backward-compatible with earlier versions. Python 2.7.18, released in 2020, was the last release of Python 2.[37]
Python consistently ranks as one of the most popular programming languages.[38][39][40][41]
taken from Wikipedia.
The output: what is python 202gragra 202 202 language 202] 202astast 202abilityability 2027ely 202uralural 202ralral Rossum Rossum 202use 202 as Rossumability code Rossum] Rossumifi Rossumleas Rossumast Rossum80
It also says:
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
I am completely new to this NLP section... what is the issue here?