0

I was looking for a way to train my own textual data using GPT-2 & I have found a blog post here: https://www.kaggle.com/code/ashiqabdulkhader/train-gpt-2-on-custom-language

Everything works fine, the model building, dataset building, but it shows very weird texts as output...

from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

output_dir = "kaggle/working/gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(output_dir)
model = TFGPT2LMHeadModel.from_pretrained(output_dir, pad_token_id=tokenizer.eos_token_id)

text = "what is python?"
input_ids = tokenizer.encode(text, return_tensors='tf')
beam_output = model.generate(
 input_ids,
 max_length=50,
 num_beams=5,
 temperature=0.7,
 no_repeat_ngram_size=2,
 num_return_sequences=5
)

print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

My corpus data is:

Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.[33]

Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming. It is often described as a "batteries included" language due to its comprehensive standard library.[34][35]

Guido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language and first released it in 1991 as Python 0.9.0.[36] Python 2.0 was released in 2000. Python 3.0, released in 2008, was a major revision not completely backward-compatible with earlier versions. Python 2.7.18, released in 2020, was the last release of Python 2.[37]

Python consistently ranks as one of the most popular programming languages.[38][39][40][41]

taken from Wikipedia.

The output: what is python 202gragra 202 202 language 202] 202astast 202abilityability 2027ely 202uralural 202ralral Rossum Rossum 202use 202 as Rossumability code Rossum] Rossumifi Rossumleas Rossumast Rossum80

It also says:

You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

I am completely new to this NLP section... what is the issue here?

mkrieger1
  • 19,194
  • 5
  • 54
  • 65

0 Answers0