You need to use the stream to get the computed response in real-time instead of waiting for the whole computation to be done and returned back to you.
With langchain, you can use stream like below:
from langchain.chat_models import ChatOpenAI
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.schema import HumanMessage
chat = ChatOpenAI(streaming=True, callbacks=[StreamingStdOutCallbackHandler()], temperature=0)
resp = chat([HumanMessage(content="I need help!")])
The streaming will output as standard output with StreamingStdOutCallbackHandler()
. You can write your own callback hander by extending the BaseCallbackHandler
and using the on_llm_new_token(self, token: str, **kwargs: Any)
method to do something else with the streaming instead of outputting them as standard output. See the following example:
from langchain.callbacks.base import BaseCallbackHandler
class MyCustomStreamingCallbackHandler(BaseCallbackHandler):
def on_llm_new_token(self, token: str, **kwargs: Any) -> None:
# do your things instead of stdout
Explanation:
When requesting completion from OpenAI, the completion is generated before being transmitted back as a single response. If you are generating lengthy completions, the waiting time for the response can be significantly prolonged, often lasting several seconds.
Let's say we are using chat completion API with the gpt-3.5-turbo
model and have the following request:
{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "user",
"content": "I need help!"
}
]
}
and we have the following response:
{
"id": "chatcmpl-123",
"object": "chat.completion",
"created": 1677652288,
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "\n\nOf course!, how may I help you?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 9,
"completion_tokens": 12,
"total_tokens": 21
}
}
The problem is this sample response is first computed, merged all tokens, and then returned all back together.
Now let's see another sample response with stream
enabled:
{
"choices": [
{
"delta": {
"role": "assistant"
},
"finish_reason": null,
"index": 0
}
],
"created": 1677825464,
"id": "chatcmpl-Bb0AYGSwYyHT3dmvkS9Lds",
"model": "gpt-3.5-turbo-0301",
"object": "chat.completion.chunk"
}
{
"choices": [
{
"delta": {
"content": "\n\nOf"
},
"finish_reason": null,
"index": 0
}
],
"created": 1677825464,
"id": "chatcmpl-rl47iois4OFY52AP1K3dSs",
"model": "gpt-3.5-turbo-0301",
"object": "chat.completion.chunk"
}
{
"choices": [
{
"delta": {
"content": "course!,"
},
"finish_reason": null,
"index": 0
}
],
"created": 1677825464,
"id": "chatcmpl-pmRtrrxTWwLIk2IY7KRosc",
"model": "gpt-3.5-turbo-0301",
"object": "chat.completion.chunk"
}
.
.
.
.
.
.
.
{
"choices": [
{
"delta": {},
"finish_reason": "stop",
"index": 0
}
],
"created": 1677825464,
"id": "chatcmpl-2qHi40hwz0cuaBxGXHeV1a",
"model": "gpt-3.5-turbo-0301",
"object": "chat.completion.chunk"
}
As you can see, you are getting the real-time computed response through streaming instead of the whole computation being done, merged, and sent back to you.