0

I am trying the Scikit-LLM on a StackOverflow question dataset comprising around 7k rows. Below is the code where I train and test a Zero Shot Classifier.

X_train, X_test, y_train, y_test = 
train_test_split(_soQuestions['Body'], _soQuestions['isClosed'], test_size=0.33, random_state=42, stratify=_soQuestions['isClosed'])
#%%
from skllm import ZeroShotGPTClassifier

clf = ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo")
clf.fit(X_train, y_train)
labels = clf.predict(X_test)

After half an hour, I received the following error. However, I have no idea how to divide the dataset into chunks of proper sizes.

Could not obtain the completion after 3 retries: InvalidRequestError :: This model's maximum context length is 4097 tokens. However, your messages resulted in 4438 tokens. Please reduce the length of the messages.

I appreciate any advice.

renakre
  • 8,001
  • 5
  • 46
  • 99
  • Not a solution, but a hint: You can use OpenAI's `Tokenizer()` (`from tiktoken import Tokenizer`) to count the tokens in your string and check if they exceed the limit. Use it with `pandas`' `apply()` method. This helps you find the row which causes the Error. Then you can inspect row and think of ways to handle it – DataJanitor Jul 12 '23 at 09:00

0 Answers0