Structuring dataset for OpenAI's GPT-3 fine tuning

Question

The fine tuning endpoint for OpenAI's API seems to be fairly new, and I can't find many examples of fine tuning datasets online.

I'm in charge of a voicebot, and I'm testing out the performance of GPT-3 for general open-conversation questions. I'd like to train the model on the "fixed" intent-response pairs we're currently using: this would probably end up performing better in terms of company voice and style.

I have ready a long JSON file of data extracted from our current conversational engine, which matches user input to intents and returns the specified response. I'd like to train a GPT-3 model on this data.

As of now, for some quick testing, I've set up my calls to the API just like they suggest. I have a "fixed" intro text in the form

<name> is <company>'s voicebot. he is kind and professional...

This is a conversation between <name> and a customer:

which is pre-pended to each query, and then a small python class which keeps track of the context which starts with

User: <request the user provides>
Bot:

then with each turn the api's response is appended, this way I'm keeping track of what is said. After a few questions, the query or prompt string i'm sending looks like this:

<name> is <company>'s voicebot. he is kind and professional...

This is a conversation between <name> and a user:

User: <request>
Bot: <response>
User: <request>
Bot: <response>
... and so on
Bot:

My question is, do i have to provide the same "format" for my training data? Is it advisable? The docs indicate that the training set should be in this format:

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
...

But does the prompt need to include my intro text (the description) each time or do i simply provide a series of user/bot exchanges with a Bot: in the end and in the completion the answer i'd expect? What would be a best practice in this case? My fear is that if i wanted to slightly change the intro prompt a month from now I'd have to retrain the whole thing again because each response was trained with that specific block of text prepended.

score 2 · Accepted Answer · answered Jan 14 '22 at 12:37

I contacted OpenAI's support and they were extremely helpful: I'll leave their answer here.

the prompt does not need the fixed intro every time. Instead, you'll just want to provide at least a few hundred prompt-completion pairs of user/bot exchanges. We have a sample of a chatbot fine-tuning dataset here.

score 0 · Answer 2 · answered Jul 08 '23 at 14:05

It's true that you do not need the intro if you have enough examples.

However, if you are working with only a small number of examples (less than 100 for example), the additional context from semantic labels can be helpful.

See the best-practices for fine-tuning GPT-3 document from OpenAI under how to pick labels:

In general, fine-tuning can work with any label, whether the label has semantic meaning (e.g., “ edible”) or not (e.g., “1”). That said, in cases with little training data per label, it’s possible that semantic labels work better, so that the model can leverage its knowledge of the label’s meaning.

You might also want consider prepending some other information in the intro section that varies with each request, like some account data about the user that the bot can reference.

Formatting the data in this way and generating a JSONL with prompt/completion pairs whenever you want to make changes can be a pain, so I recommend a tool like Entry Point AI to create templates for the prompt/completion pairs that use defined fields, manage your training data, and test the results from your fine-tunes.

Structuring dataset for OpenAI's GPT-3 fine tuning

2 Answers2