I'm trying to parse a json data file into jsonl for GPT-3 fine-tuning. I'm specifically trying to find out how to merge contents from the same Author ID.
Where user 123
is prompt
and user 456
is completion
.
The data:
{
"messages": [
{
"content": "hello",
"author": {
"id": "123"
}
},
{
"content": "how are you",
"author": {
"id": "123"
}
},
{
"content": "hey! I'm good",
"author": {
"id": "456"
}
},
{
"content": "That's nice!",
"author": {
"id": "123"
}
},
{
"content": "I'm glad to hear you're doing well.",
"author": {
"id": "123"
}
},
{
"content": "Thank you. what about you?",
"author": {
"id": "456"
}
}
]
}
Output should look like this:
{"prompt": "hello how are you", "completion": "hey! I'm good"}
{"prompt": "That's nice! I'm glad to hear you're doing well.", "completion": "Thank you. what about you?"}
I'm super close to figuring this out. I'll be updating the question if I ever find an answer. Thank you for your time!
My attempt so far.
myjson = {
"messages": [
{
"content": "hello",
"author": {
"id": "123"
}
},
{
"content": "how are you",
"author": {
"id": "123"
}
},
{
"content": "hey! I'm good",
"author": {
"id": "456"
}
},
{
"content": "That's nice!",
"author": {
"id": "123"
}
},
{
"content": "I'm glad to hear you're doing well.",
"author": {
"id": "123"
}
},
{
"content": "Thank you. what about you?",
"author": {
"id": "123"
}
}
]
}
messages = []
for k1, v1 in myjson.items():
for k2 in v1:
messages.append((k2["author"]["id"], k2["content"]))
jsons = {}
j = 0
i = 0
while j < len(messages)-1:
message = ""
answer = ""
temp = messages[j][0]
while j < len(messages)-1 and messages[j][0] == temp:
message += f"\n{messages[j][1]}"
j += 1
temp = messages[j][0]
while j < len(messages)-1 and messages[j][0] == temp:
answer += f"\n{messages[j][1]}"
j += 1
jsons[i] = {"prompt": message, "completion": answer}
i += 1
for v1 in jsons.values():
print(v1)