1

when I use the following code - which summarizes long PDFs -, it works fine for the first PDF. But if I use it for a second PDF (that is, I change the file path to another PDF), it still puts out the summary for the first PDF, as if the embeddings from the first PDF/previous round get somehow stored and not deleted.

from langchain.document_loaders import PyPDFLoader # for loading the pdf
from langchain.embeddings import OpenAIEmbeddings # for creating embeddings
from langchain.vectorstores import Chroma # for the vectorization part
from langchain.chains import ChatVectorDBChain # for chatting with the pdf
from langchain.llms import OpenAI # the LLM model we'll use (CHatGPT)
import os

os.environ["OPENAI_API_KEY"] = "my_API_KEY"

pdf_path = "file_path"
loader = PyPDFLoader(pdf_path)
pages = loader.load_and_split()
print(pages[1].page_content)

embeddings = OpenAIEmbeddings()
vectordb = Chroma.from_documents(pages, embedding=embeddings,
                                 persist_directory=".")

vectordb.persist()


pdf_qa = ChatVectorDBChain.from_llm(OpenAI(temperature=0.9, model_name="gpt-3.5-turbo"),
                                    vectordb, return_source_documents=True)


query = "Write a summary of the text." 
result = pdf_qa({"question": query, "chat_history": ""})
print(result["answer"])

This behavior holds true even when re-starting Python, or when I try a number of other PDFs. I started renaming all objects, and sometimes this helps. But right now even after renaming all objects, it still puts out the summary for the previous PDF. I am so confused about this behavior.

Any clue how I can delete the vectors from the previous round or fix this?

TylerH
  • 20,799
  • 66
  • 75
  • 101
rna_2090
  • 11
  • 4

1 Answers1

0

Change your persist directory so its different on each

rbakhru
  • 61
  • 4
  • hey, thanks for your reply! are you saying to change it to any string? doesn't matter what combination of characters? – rna_2090 May 07 '23 at 17:28