0

I've read there are ways to exceed token limitations for inputs. These are the following methods: Stuff, Map Reduce, Refine, Map Rerank. In my context, I want to produce a large JSON document. The issue with JSON documents is that GPT models aside from CODEX do not handle spaces very well. For instance, this JSON file

[
    {
        "id": 1,
        "category": "Player effects",
        "details": [
            {
                "effect": "Give weapons",
                "cheat": "Triangle, R2, Left, L1, Cross, Right, Triangle, Down, Square, L1, L1, L1"
            },
            {
                "effect": "Max health + Armor",
                "cheat": "Circle, L1, Triangle, R2, Cross, Square, Circle, Right, Square, L1, L1, L1"
            },
            {
                "effect": "Invincibility",
                "cheat": "Right, Cross, Right, Left, Right, R1, Right, Left, Cross, Triangle"
            },
            {
                "effect": "Lower wanted level",
                "cheat": "R1, R1, Circle, R2, Right, Left, Right, Left, Right, Left"
            },
            {
                "effect": "Raise wanted level",
                "cheat": "R1, R1, Circle, R2, Left, Right, Left, Right, Left, Right"
            },
            {
                "effect": "Special ability recharge",
                "cheat": "Cross, Cross, Square, R1, L1, Cross, Right, Left, Cross"
            },
            {
                "effect": "Bang bang!",
                "cheat": "Right, Square, Cross, Left, R1, R2, Left, Right, Right, L1, L1, L1"
            },
            {
                "effect": "Flaming bullets",
                "cheat": "L1, R1, Square, R1, Left, R2, R1, Left, Square, Right, L1, L1"
            },
            {
                "effect": "Explosive melee attacks",
                "cheat": "Right, Left, Cross, Triangle, R1, Circle, Circle, Circle, L2"
            },
            {
                "effect": "Super jump",
                "cheat": "L2, L2, Square, Circle, Circle, L2, Square, Square, Left, Right, Cross"
            },
            {
                "effect": "Give parachute",
                "cheat": "Left, Right, L1, L2, R1, R2, R2, Left, Left, Right, L1"
            },
            {
                "effect": "Skyfall",
                "cheat": "L1, L2, R1, R2, Left, Right, Left, Right, L1, L2, R1, R2, Left, Right, Left, Right"
            },
            {
                "effect": "Drunk mode",
                "cheat": "Triangle, Right, Left, Right, Square, Circle, Left"
            },
            {
                "effect": "Fast Run",
                "cheat": "Triangle, Left, Right, Right, L2, L1, Square"
            },
            {
                "effect": "Fast swim",
                "cheat": "Left, Left, L1, Right, Right, R2, Left, L2, Right"
            },
            {
                "effect": "Slow motion aiming",
                "cheat": "Square, L2, R1, Triangle, Left, Square, L2, Right, Cross"
            }
        ]
    },
    {
        "id": 2,
        "category": "World effects",
        "details": [
            {
                "effect": "Change weather",
                "cheat": "R2, Cross, L1, L1, L2, L2, L2, Square"
            },
            {
                "effect": "Slidey cars",
                "cheat": "Triangle, R1, R1, Left, R1, L1, R2, L1"
            },
            {
                "effect": "Slow motion",
                "cheat": "Triangle, Left, Right, Right, Square, R2, R1"
            },
            {
                "effect": "Moon gravity",
                "cheat": "Left, Left, L1, R1, L1, Right, Left, L1, Left"
            }
        ]
    },
    {
        "id": 3,
        "category": "Vehicle",
        "details": [
            {
                "effect": "Spawn BMX",
                "cheat": "Left, Left, Right, Right, Left, Right, Square, Circle, Triangle, R1, R2"
            },
            {
                "effect": "Spawn Buzzard",
                "cheat": "Circle, Circle, L1, Circle, Circle, Circle, L1, L2, R1, Triangle, Circle, Triangle"
            },
            {
                "effect": "Spawn Caddy",
                "cheat": "Circle, L1, Left, R1, L2, Cross, R1, L1, Circle, Cross"
            },
            {
                "effect": "Spawn Comet",
                "cheat": "R1, Circle, R2, Right, L1, L2, Cross, Cross, Square, R1"
            },
            {
                "effect": "Spawn Duster",
                "cheat": "Right, Left, R1, R1, R1, Left, Triangle, Triangle, Cross, Circle, L1, L1"
            },
            {
                "effect": "Spawn Limousine",
                "cheat": "R2, Right, L2, Left, Left, R1, L1, Circle, Right"
            },
            {
                "effect": "PCJ-600",
                "cheat": "R1, Right, Left, Right, R2, Left, Right, Square, Right, L2, L1, L1"
            },
            {
                "effect": "Spawn Rapid GT",
                "cheat": "R2, L1, Circle, Right, L1, R1, Right, Left, Circle, R2"
            },
            {
                "effect": "Spawn Sanchez",
                "cheat": "Circle, Cross, L1, Circle, Circle, L1, Circle, R1, R2, L2, L1, L1"
            },
            {
                "effect": "Spawn Stunt Plane",
                "cheat": "Circle, Right, L1, L2, Left, R1, L1, L1, Left, Left, Cross, Triangle"
            },
            {
                "effect": "Spawn Trashmaster",
                "cheat": "Circle, R1, Circle, R1, Left, Left, R1, L1, Circle, Right"
            }
        ]
    },
    {
        "id": 4,
        "category": "Special Vehicles",
        "details": [
            {
                "effect": "Spawn Dodo",
                "cheat": "1-999-398-4628 (EXTINCT)"
            },
            {
                "effect": "Spawn Duke O'Death",
                "cheat": "1-999-3328-4227 (DEATHCAR)"
            },
            {
                "effect": "Spawn Kraken",
                "cheat": "1-999-282-2537 (BUBBLES)"
            }
        ]
    }
]

is Tokens: 3,432 Characters: 5703 according to GPT-3 and is Tokens: 1,688 Characters: 5703 according to CODEX. Source(https://platform.openai.com/tokenizer). If this JSON output were supposedly 5 times bigger, what would be the best way to handle it with Langchain using a model like text-davinci-003

Yilmaz
  • 35,338
  • 10
  • 157
  • 202
NotAPhoenix
  • 171
  • 4
  • 15
  • What if you were to shorten the cheat inputs such `Circle` and `Left` to `C` and `L` respectively? – InsertCheesyLine May 22 '23 at 12:57
  • we need a minimal reducible example. Its not clear what you are trying to do. If you want to generate large amounts of json you probally want to generate it it chunks and then nit it together at thend – Nath May 23 '23 at 06:26
  • LangChain doesn't allow you to exceed token limits. It compresses your data in such a way that the relevant parts are expressed in fewer tokens. What you can do is split the problem into multiple parts, e.g. only output 5 effects at a time, producing a json each time, and then merge the json. – Nearoo May 23 '23 at 09:12

2 Answers2

1

I had similar issues, thinking that token limit is enough for my tasks. While I don't have exact recipe for you, I suggest the following:

  1. rethink the problem into sub-problems. At first glance, you could generate categories independently and then aggregate them in your code.
  2. replace JSON for subtasks. From my experience JSON introduces lot's of token overhead. If your data could be represented as a table, csv is more optimal.
  3. limit what kind of data LLM should receive or generate. Is it really neccessary to get category id from it?
Bulat L
  • 21
  • 5
0
  • you can use map_reduce

map_reduce splits the document into small chunks that fit within the token limit of the model. It summarizes all the chunks independently and then combines those summaries. the downside is you make more api calls so it will cost you more. Also, when it combines the summaries, it might lose some data.

  • you can use refine chain

this makes a summary for the first chunk, adds this summary to the second chunk and then make an api call to summary this combined data. this is how it works.

1- first_summary=Summary(first_chunk)
2- second_summary=Summary(first_summary + second chunk)

with this method, you do not lose as much data as it is in map_reduce. with this method, you make too many api calls as well, but this time those api calls are not independent, so this method may take longer

Yilmaz
  • 35,338
  • 10
  • 157
  • 202