I'm using LLMs for classifying products into specific categories. Multi-Class.
One way to do it would it to ask if it's a yes/no for a specific category and loop through the categories.
Another way would be to ask for a probability that that certain product belongs to one of those classes.
The second option allows me to adjust the prediction thresholds in "post" and over/under-classify certain classes.
However, The word on the street is that RLHF-trained OpenAI models such as gpt-3.5-turbo
and gpt-4
are weak at guessing probabilities relative to text completion models like text-davinci-003
because RLHF training makes the model "think" more like a human (bad at guessing probabilities).
Are there any literature I can read up on/ should know about? Before I go ahead and run a 100 tests.
I've not tried anything as of yet given that testing is time/cost intensive. And would like a baseline understanding of how to tackle the problem before starting.