4

For a bit of context, I recently started working on a personal project that accepts the URL of some recipe web page, pulls the HTML, converts the HTML to simplified markdown (this is the GPT-3 part), then sends that markdown to a thermal receipt printer in my kitchen, which prints it out.

Recipe web pages have a wide variety of structures, and they are notorious for including long and often irrelevant articles before the recipe, for the sake of SEO.

My plan was to use the fine-tuning API for davinci2, and feed it a bunch of straight up recipe HTML as input and cleaned, recipe-only markdown as output. I notice though that the maximum input token count for both training and inference is 4096. The HTML for a web page can be much larger than that, like 20k tokens.

I am wondering if anyone has found a workaround for training and driving GPT-3 with more tokens than 4096.

I'm open to other suggestions as well. For instance, I've considered passing just the visible text on the page, rather than the full HTML tree, but there is much less context present in that form, and the models seems more easily confused by all of the links and other navigational elements present in the page. I have also considered only allowing this project to accept "printer-friendly" versions of recipes, which tend to be much smaller and would easily come in under the 4096 token limit, but not all sites offer a printer-friendly article, and I don't want this to be a limitation.

Chris Doohan
  • 61
  • 1
  • 3

2 Answers2

1

this framework might be useful to you: https://github.com/Xpitfire/symbolicai

The basic idea is:

  1. You could stream among your input data and build up a stack on the side.
  2. Next, in your training procedure, you need to account for having loosely connected chunks of data. This you could overcome by indexing or clustering the chunks before designing your prompts.
  3. This means, if you want to create a query for a question that is related to your long data stream, you could search through your indexes and retrieve the related information.
  4. Now you need to parse together your few-shot learning prompt that accounts for a "section" in your prompt that relates to your query and another one for the facts you wanted to include.
  5. Finally, you can then feed that into your model and provide examples of what you want your model to be tuned to.

I know this a bit high-level explained, but maybe if you follow the link I provided, things might get more clear.

0

Do not know of any work arounds but have you thought of perhaps filtering the HTML elements out based on some basic rules. You can include only paragraph elements or

elements that have certain characteristics, like having a list within them, which is something most recipes have.

edutuario
  • 76
  • 4
  • The hard part is finding a rule of thumb that works for a large percentage of sites. Looking over the recipes I've cooked over the past 3 months, they were sourced from 15 different websites. Someone else suggested doing an initial check to see if there's some schema markup, which is an interesting idea as well. – Chris Doohan Jul 30 '22 at 19:43