i am playing with langchain, openai, and pinecone (vectordb).
i have generated a random list of toys in total 16. Each toy is in a row with a small description.
- LEGO sets: LEGO offers a wide range of building sets, including themed sets based on movies, superheroes, and more.
- Barbie dolls: These iconic fashion dolls have been popular for decades, and they come in various themes and styles.
- Nerf blasters: Nerf guns and blasters are foam-based toys that allow children (and adults) to have safe and fun mock battles.
- Hatchimals: These interactive toys come in eggs and "hatch" to reveal surprise characters that kids can nurture and play with.
my goal was to feed this list to pinecone and query it with questions.
loader = TextLoader("toys.txt")
document = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1, chunk_overlap=0)
texts = text_splitter.split_documents(document)
embeddings = OpenAIEmbeddings(
openai_api_key="xxxxxxx"
)
docsearch = Pinecone.from_documents(texts, embeddings, index_name="toys")
qa = VectorDBQA.from_chain_type(llm=OpenAI(openai_api_key="xxxx"),chain_type='stuff',vectorstore=docsearch, return_source_documents=True)
query= "how many toys do you have"
result = qa({"query": query})
it ends up always with different numbers. Its sometimes 10,13,8,16... in total I have 16 toys in my list. But it actually fails quite often...
I wonder if I can improve this with putting more information into the prompt. I tried out putting a description: "this is a list of toys, each number represents the ID of the toys following with the Name and a description. "
if this setup fails already with such a simple case, I wonder how precise it can work with larger files or data. Currently I have only 1 vector in my database, because the list is not big.