0

i am playing with langchain, openai, and pinecone (vectordb).

i have generated a random list of toys in total 16. Each toy is in a row with a small description.

  1. LEGO sets: LEGO offers a wide range of building sets, including themed sets based on movies, superheroes, and more.
  2. Barbie dolls: These iconic fashion dolls have been popular for decades, and they come in various themes and styles.
  3. Nerf blasters: Nerf guns and blasters are foam-based toys that allow children (and adults) to have safe and fun mock battles.
  4. Hatchimals: These interactive toys come in eggs and "hatch" to reveal surprise characters that kids can nurture and play with.

my goal was to feed this list to pinecone and query it with questions.

loader = TextLoader("toys.txt")
document = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=1, chunk_overlap=0)
texts = text_splitter.split_documents(document)

embeddings = OpenAIEmbeddings(
    openai_api_key="xxxxxxx"
)
    docsearch = Pinecone.from_documents(texts, embeddings, index_name="toys")
   
    qa = VectorDBQA.from_chain_type(llm=OpenAI(openai_api_key="xxxx"),chain_type='stuff',vectorstore=docsearch, return_source_documents=True)
query= "how many toys do you have"
result = qa({"query": query})

it ends up always with different numbers. Its sometimes 10,13,8,16... in total I have 16 toys in my list. But it actually fails quite often...

I wonder if I can improve this with putting more information into the prompt. I tried out putting a description: "this is a list of toys, each number represents the ID of the toys following with the Name and a description. "

if this setup fails already with such a simple case, I wonder how precise it can work with larger files or data. Currently I have only 1 vector in my database, because the list is not big.

Khan
  • 1,418
  • 1
  • 25
  • 49
  • Since you have 1 vector in your database, the retriever will simply return that chunk every time? You can simplify the issue outside the scope of the vectorDBQA and put that content in the prompt manually! What is the token size of the chunk? – Joost Döbken Aug 01 '23 at 09:04
  • hi, yes you are right and the setup makes not much sense at the moment, But I actually plan to put more data into the vectordb. Since I wanted to test it with a single chunk and it failed already :D I am wondering what I am doing wrong. But I guess its just a limitation of the language model. Because it also did not work correctly, when I copy pasted the chunk as text to chatgpt. I guess I need some examples for reasoning so the model understands better how to proceed... I have not tried it out, but I researched it – Khan Aug 01 '23 at 20:48
  • 2
    Also you need to understand that the retriever only returns a couple of chunks based on semantic similarity with the question; it can only answer questions about specific details in the text. To ask questions to the entire text, such as getting a summary or counting the number of toys, you should feed ALL chunks (without overlap) to something such as a [MapReduce chain](https://python.langchain.com/docs/modules/chains/document/map_reduce). – Joost Döbken Aug 02 '23 at 07:21
  • yes I understood thank you for this. it will be beneficial when I have a larger text and more chunks. But as I mentioned, currently I have only 1 chunk. There is no other chunk. So it should actually retrieve the 1 and only chunk and it still returns different results. – Khan Aug 09 '23 at 10:21

0 Answers0