1

I am currently running a QA model using load_qa_with_sources_chain(). However, when I run it with three chunks of each up to 10,000 tokens, it takes about 35s to return an answer. I would like to speed this up.

Can somebody explain what influences the speed of the function and if there is any way to reduce the time to output. If that's not possible, what other changes could you undertake to increase the speed of QA with sources?

I tried changing the size of the text chunks but that did not have a significant effect. I am using the map_reduce chain. I am using Python3.10.

derlunter
  • 11
  • 1
  • 3

2 Answers2

0

You need to use the stream to get the computed response in real-time instead of waiting for the whole computation to be done and returned back to you.

With langchain, you can use stream like below:

from langchain.chat_models import ChatOpenAI
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.schema import HumanMessage

chat = ChatOpenAI(streaming=True, callbacks=[StreamingStdOutCallbackHandler()], temperature=0)
resp = chat([HumanMessage(content="I need help!")])

The streaming will output as standard output with StreamingStdOutCallbackHandler(). You can write your own callback hander by extending the BaseCallbackHandler and using the on_llm_new_token(self, token: str, **kwargs: Any) method to do something else with the streaming instead of outputting them as standard output. See the following example:

from langchain.callbacks.base import BaseCallbackHandler

class MyCustomStreamingCallbackHandler(BaseCallbackHandler):
    def on_llm_new_token(self, token: str, **kwargs: Any) -> None:
        # do your things instead of stdout

Explanation:

When requesting completion from OpenAI, the completion is generated before being transmitted back as a single response. If you are generating lengthy completions, the waiting time for the response can be significantly prolonged, often lasting several seconds.

Let's say we are using chat completion API with the gpt-3.5-turbo model and have the following request:

{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "user",
      "content": "I need help!"
    }
  ]
}

and we have the following response:

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "\n\nOf course!, how may I help you?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 12,
    "total_tokens": 21
  }
}

The problem is this sample response is first computed, merged all tokens, and then returned all back together.

Now let's see another sample response with stream enabled:

{
  "choices": [
    {
      "delta": {
        "role": "assistant"
      },
      "finish_reason": null,
      "index": 0
    }
  ],
  "created": 1677825464,
  "id": "chatcmpl-Bb0AYGSwYyHT3dmvkS9Lds",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion.chunk"
}
{
  "choices": [
    {
      "delta": {
        "content": "\n\nOf"
      },
      "finish_reason": null,
      "index": 0
    }
  ],
  "created": 1677825464,
  "id": "chatcmpl-rl47iois4OFY52AP1K3dSs",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion.chunk"
}
{
  "choices": [
    {
      "delta": {
        "content": "course!,"
      },
      "finish_reason": null,
      "index": 0
    }
  ],
  "created": 1677825464,
  "id": "chatcmpl-pmRtrrxTWwLIk2IY7KRosc",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion.chunk"
}
.
.
.
.
.
.
.

{
  "choices": [
    {
      "delta": {},
      "finish_reason": "stop",
      "index": 0
    }
  ],
  "created": 1677825464,
  "id": "chatcmpl-2qHi40hwz0cuaBxGXHeV1a",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion.chunk"
}

As you can see, you are getting the real-time computed response through streaming instead of the whole computation being done, merged, and sent back to you.

0

My understanding is that it's a general issue currently.
See https://github.com/hwchase17/langchain/issues/1702.
Would be great to learn if there are ways to make it "less slow" currently.

Clemens
  • 653
  • 6
  • 8