OpenAI GPT-3 API: Which file formats can be used for fine-tuning?

Question

As we are getting in to turbulent times of AI. I am as well spilling mine drop in to ocean. As I am pythonian, all attempts are done in python/anaconda.

Does anybody have already some experience in "data formats" passable to GPT family of AIs?

In documentation is recommended use of OpenAI tool for control. Followed by documentation recommending format ("Prompt:", "Completion:") With strings marked as:

  ["str" = in quotes,"/" = separator ,"@>" = unique symbol, 
   " " = comp. starts with empty space]

  'Prompt':    'Hello AI..!!/@>' 
  'Completion': ' How are you today?/@>'

"Completion" should have empty space at start of every sting. So far I was able to find just simple examples as:

Col1             Col2
'Prompt':        'Completion':
'Text/@>'        ' Text/@>'

Is there any way it will understand more complex dataset? Is effective to have more dim. DataFrame? Example:

     Col1        Col2             Col3         Col4        
    'Prompt_a':  'Completion_a':  'Prompt_b':  'Completion_b':
    'Text/@>'    ' Text/@>'       'Text/@>'    ' Text/@>

Is longer context text passed just as 'str/@>', or is some partition needed?

' text text text /@>'

Many thanks for all answers and efforts in advance.

Already checked: https://help.openai.com/en/articles/6811186-how-do-i-format-my-fine-tuning-data

score 1 · Answer 1 · answered Feb 24 '23 at 19:10

1

As stated in the official OpenAI documentation:

Your data must be a JSONL document, where each line is a prompt-completion pair corresponding to a training example. You can use our CLI data preparation tool to easily convert your data into this file format.

This tool accepts different formats, with the only requirement that they contain a prompt and a completion column/key. You can pass a CSV, TSV, XLSX, JSON or JSONL file, and it will save the output into a JSONL file ready for fine-tuning, after guiding you through the process of suggested changes.

answered Feb 24 '23 at 19:10

Rok Benko

14,265
2
24
49

Hi Cervus thanks for your message pls, I was looking in to that and what I am missing(was not able to find till now) is for example: 1) keys/columns can have just 1 prompt/completion pair or more is accepted. 2) if I have more completion for 1 question would be possible to send 1 prompt vs. N completion (it should be). 3) Can I have more classifiers (different purpose) for example: fruits = 'annas /#n', books = 'map atlas /#%'. Where /#n and /#% are exclusive. – Jan M. Feb 25 '23 at 14:23
1

Hi, Jan. I will make a test and edit my answer tomorrow, so check out my answer tomorrow. :) I want to be sure that I'm not guessing, but rather testing myself before I answer you. – Rok Benko Feb 26 '23 at 10:57
Thanks Cervus, I was looking in to Open AI Doc and they do have some guidance in that as follows: {"prompt":"\n\n###\n\n", "completion":" END"}. Plus I was able to find multiple other possible formats. Many Many thanks for testing I will try to run something by myself (just starting so expecting some Errors :)) ) – Jan M. Feb 26 '23 at 14:05
Just for record: https://gpt-index.readthedocs.io/en/latest/getting_started/starter_example.html ...This look like can do a job bit it is not saying how exactly :) – Jan M. Feb 26 '23 at 14:41
Rok sorry for misreading your comment I thought it was from Cervus. – Jan M. Feb 27 '23 at 09:23
So at least something from mine site. I have formatted mine prompt/completion as follows: '{"prompt":"$1_text/+###>"},{"completion":" $2_text/+--->END"}' and still having error as so: _ERROR in necessary column validator: `prompt` column/key is missing. Please make sure you name your columns/keys appropriately, then retry_ even when two columns with exact names are present. – Jan M. Mar 04 '23 at 12:44
Hi, Jan! I didn't forget about you. I'm just very busy and didn't have time to find a solution. Did you figure out anything in the meantime? – Rok Benko Mar 12 '23 at 22:13
Hi Rok, I did try to go via examples and do prompt according that. Really do not matter what I will send to CLI tool still same error. To be honest documentation is not really clear on that. To some extent is it looking like depreciated. I am about to try out that library. (gpt-index/ llama), plus reading via Open AI form but it is looking new as well. Any help just appreciated!!! – Jan M. Mar 13 '23 at 11:56
I will be trying one more option with added 'str' for classification if that help. Something like: _{"Prompt":"classification_str : text_str + prompt_end_mark", "completion" : "text_str + end_mark"}_ Where is not clear why is classification necessary. – Jan M. Mar 13 '23 at 12:23

OpenAI GPT-3 API: Which file formats can be used for fine-tuning?

1 Answers1