0

I'm trying to parse a json data file into jsonl for GPT-3 fine-tuning. I'm specifically trying to find out how to merge contents from the same Author ID.

Where user 123 is prompt and user 456 is completion.

The data:

{
  "messages": [
    {
      "content": "hello",
      "author": {
        "id": "123"
      }
    },
    {
      "content": "how are you",
      "author": {
        "id": "123"
      }
    },
    {
      "content": "hey! I'm good",
      "author": {
        "id": "456"
      }
    },
    {
      "content": "That's nice!",
      "author": {
        "id": "123"
      }
    },
    {
      "content": "I'm glad to hear you're doing well.",
      "author": {
        "id": "123"
      }
    },
    {
      "content": "Thank you. what about you?",
      "author": {
        "id": "456"
      }
    }
  ]
}

Output should look like this:

{"prompt": "hello how are you", "completion": "hey! I'm good"}
{"prompt": "That's nice! I'm glad to hear you're doing well.", "completion": "Thank you. what about you?"}

I'm super close to figuring this out. I'll be updating the question if I ever find an answer. Thank you for your time!

My attempt so far.

myjson = {
    "messages": [
        {
            "content": "hello",
            "author": {
                "id": "123"
            }
        },
        {
            "content": "how are you",
            "author": {
                "id": "123"
            }
        },
        {
            "content": "hey! I'm good",
            "author": {
                "id": "456"
            }
        },
        {
            "content": "That's nice!",
            "author": {
                "id": "123"
            }
        },
        {
            "content": "I'm glad to hear you're doing well.",
            "author": {
                "id": "123"
            }
        },
        {
            "content": "Thank you. what about you?",
            "author": {
                "id": "123"
            }
        }
    ]
}
messages = []
for k1, v1 in myjson.items():
    for k2 in v1:
        messages.append((k2["author"]["id"], k2["content"]))
jsons = {}
j = 0
i = 0
while j < len(messages)-1:
    message = ""
    answer = ""
    temp = messages[j][0]
    while j < len(messages)-1 and messages[j][0] == temp:
        message += f"\n{messages[j][1]}"
        j += 1
    temp = messages[j][0]
    while j < len(messages)-1 and messages[j][0] == temp:
        answer += f"\n{messages[j][1]}"
        j += 1
    jsons[i] = {"prompt": message, "completion": answer}
    i += 1
for v1 in jsons.values():
    print(v1)
Dispay
  • 1
  • 1
  • Looks like you forgot to include the attempt you were "super close" on – Jab Nov 03 '21 at 02:46
  • My bad- Here's what I've got so far. – Dispay Nov 03 '21 at 02:47
  • https://pastebin.com/6PRbFZZG – Dispay Nov 03 '21 at 02:48
  • Please [edit](https://stackoverflow.com/posts/69819089/edit) your code with the code you've tried. Not post a link in a comment. This helps us know what you're talking about without added steps. If you want help don't make it hard for people to help you. – Jab Nov 03 '21 at 02:53
  • I am extremely sorry- First time posting. I'll keep that in mind for my future questions. Thank you so much for your time! – Dispay Nov 03 '21 at 02:57
  • No worries, sorry if I came off strong, it happens alot. – Jab Nov 03 '21 at 03:02
  • I think in `myjson` the last `dict`s `id` is supposed to be `456` as your result doesn't match up... – Jab Nov 03 '21 at 03:10

0 Answers0