Understanding parameters in LlamaIndex's prompt_helper and reducing token usage

Question

I need a clear explanation about the parameters of the prompt_helper in LlamaIndex. Unfortunately the documentation about this is very short. I would like to understand the actual role of each parameter in creating the index.

My goal is to reduce the amount of token used to query the index without affecting the quality of the output too much. I thought that reducing the context_window would reduce the token usage, but it did not happen. I have tried on the same document to apply at first context_window=4096 and after context_window=2000. Asking the same question the token used was the same.

Does anyone have a clear idea of the role of each parameter in creating the index? If so, do you also have any suggestions on how to reduce the tokens used during the query (not creation) of the index?

here an example of my code

from langchain.embeddings.openai import OpenAIEmbeddings
from llama_index import GPTVectorStoreIndex, ServiceContext, LLMPredictor, PromptHelper, StorageContext, load_index_from_storage, LangchainEmbedding
from llama_index.vector_stores import DeepLakeVectorStore

openai_api_key = 'xxx'

context_window = 4096 # or 2000
num_output = 256
chunk_overlap_ratio = 0.2
completion_model_name = "gpt-3.5-turbo"
embed_model_name = "text-embedding-ada-002"

llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name=completion_model_name, openai_api_key=openai_api_key))

prompt_helper = PromptHelper(context_window=context_window, num_output=num_output,chunk_overlap_ratio=chunk_overlap_ratio)

embed_model = LangchainEmbedding(OpenAIEmbeddings(openai_api_key=openai_api_key,model=embed_model_name))

service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor,prompt_helper=prompt_helper,embed_model=embed_model)

dataset_path = 'dataset_path'

try:
    vector_store = DeepLakeVectorStore(dataset_path=f"gcs://{dataset_path}", overwrite=False,
                                       read_only=True)
    storage_context = StorageContext.from_defaults(vector_store=vector_store, persist_dir=dataset_path,fs=fs_gcs)
    index = load_index_from_storage(storage_context=storage_context, service_context=service_context)
except FileNotFoundError:
    vector_store = DeepLakeVectorStore(dataset_path=f"gcs://{dataset_path}", overwrite=True)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    index = GPTVectorStoreIndex.from_documents(document, storage_context=storage_context,
                                               service_context=service_context)
    index.storage_context.persist(persist_dir=dataset_path, fs=fs_gcs)

query_engine = index.as_query_engine(ervice_context=service_context)
response = query_engine.query('what is this document about?')

print(f'token used for question {llm_predictor.last_token_usage}') ```

Understanding parameters in LlamaIndex's prompt_helper and reducing token usage

0 Answers0