Data classification against a large set of taxonomies with GPT

Question

we're trying to use GPT-(3.5 or 4) to classify some data extracted from LinkedIn against a set of our own taxonomies. For example, we get something like this for the profile:

{
  "title": "Senior tender at bar",
  "description": "My job was to serve drinks and talk to loners",
  "company": "ABI (Awesome Bar Inc)",
}

and we have a (very long) list of taxonomies:

roles: [ "Mailman", "Tupperware salesman", "Bartender", "Pool boy", ... ]
companies: [ "USPS", "DHL", "Microsoft", "Google", "Awesome Bar", "Samsung", ... ]

And we need something to do the matching, something like:

{ ...experience, roleTaxonomy: "Bartender", companyTaxonomy: "Awesome Bar" }

We're using OpenAI's GPT API to do this and it works reasonably well, since it's only a suggestion system for the end user, we don't need it to be extremely accurate, just to work well in a good percentage of situations.

The problem? We have ~80k companies, ~50k universities, ~50k cities, 600 languages, 400 roles. And since the API doesn't have a persistent session, we'd need to send all to make the match, even sticking to a single taxonomy each time, we are surpassing greatly the amount of tokens permitted.

So, the question is: Has anyone worked with this sort of scenario before? Having a very large dataset to do a classification against it?

Note: We went with GPT, because normal string proximity algorithms wouldn't work, since we need a bit of context for the AI to infer what we're talking about, there's also the matter of different languages also that GPT seems to handle very well

Other note: We're not very attached to any type of tech here, we can use an OpenAI substitute, GPT substitute, or any sort of substitute for that matter

Things we've tried so far:

Sending all taxonomies, results in an error because the max amount of tokens is surpassed (very, very, very surpassed). It doesn't seem any type of taxonomy optimization will do the trick
Trying to use tools like Stack-ai with vector stores, but it seems to be filtering the taxonomies in a weird way, giving us weird results
Even using chatGPT, but that also seems to work poorly with "Big" datasets. When I say big, if I send around 1000 taxonomies it's not capable of understanding them

Data classification against a large set of taxonomies with GPT

0 Answers0