How to fine-tune gpt2 with a custom set of unlabelled document

Question

I'm newbie to GPT2 fine-tuning. My goal is to fine-tune GPT-2 (or BERT) on a my own set of document, in order to be able to query the bot on a topic contained in these documents, and receive an answer. I have some doubts on how to develop this, because I saw that fine tuning a Question and Answer chatbot requires a labelled dataset, containing questions relatet to a answer.

Is it possible to fine tune a language model on an unlabelled dataset? After I train the model on my data, can I already query it or anyway is there a need to fine-tune on a specific task using an annotated dataset? Is there a minumum number of documents on order to achieve good results? Is it possible to do on a non-english language? Thank you.

score 0 · Answer 1 · answered Mar 27 '23 at 14:29

You can fine tune GPT2 with an unlabeled dataset, but I can assure you the results won't be what you're looking for. You can ask it a question, but because that question doesn't really appear in your data, it might not answer in a very legible way. What you could do is prompt it the first sentence of what you think the answer is, and let GPT2 fill in the rest, or you could try a few-shot prompt, where you give it multiple examples of question and answer pairs, and then finish the prompt with the question you want answered, but mileage may vary. Once you train the model with the unlabeled data you can query it, but if you want to prepare a proper dataset, I recommend training just on that. I would recommend at least a couple hundred good examples (rows), and working your way up from there. And to the best of my knowledge, GPT2 was trained on English only, but there are multilingual LLMs out there.

How to fine-tune gpt2 with a custom set of unlabelled document

1 Answers1