I am trying to get a LangChain application to query a document that contains different types of information. To facilitate my application, I want to get a response in a specific format, so I am using Pydantic to structure the data as I need, but I am running into an issue.
Sometimes ChatGPT doesn't respect the format from my Pydantic structure, and so I get an exception raised and my program stops. Sure, I can handle the exception, but I much rather that ChatGPT respects the format, and I wonder if I am doing something wrong.
More specifically:
- The date is not formatted in the right way from ChatGPT since it returns the date from the document as it found it, and not in a datetime.date format.
- The Enum Field from Pydantic also doesn't work well, as sometimes the documents have Lastname, and not Surname, and ChatGPT formats it as Lastname and it doesn't transform it to Surname.
Lastly, I do not know if I am using the chains correctly because I keep getting confused with all the different examples in the LangChain documentation.
After loading all the necessary packages, this is the code I have:
FILE_PATH = 'foo.pdf'
class NameEnum(Enum):
Name = 'Name'
Surname = 'Surname'
class DocumentSchema(BaseModel):
date: datetime.date = Field(..., description='The date of the doc')
name: NameEnum = Field(..., description='Is it name or surname?')
def main():
loader = PyPDFLoader(FILE_PATH)
data = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=10)
all_splits = text_splitter.split_documents(data)
vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())
llm = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0)
question = """What is the date on the document?
Is it about a name or surname?
"""
doc_prompt = PromptTemplate(
template="Content: {page_content}\nSource: {source}",
input_variables=["page_content", "source"],
)
prompt_messages = [
SystemMessage(
content=(
"You are a world class algorithm for extracting information in structured formats."
)
),
HumanMessage(content="Answer the questions using the following context"),
HumanMessagePromptTemplate.from_template("{context}"),
HumanMessagePromptTemplate.from_template("Question: {question}"),
HumanMessage(
content="Tips: Make sure to answer in the correct format"
),
]
chain_prompt = ChatPromptTemplate(messages=prompt_messages)
chain = create_structured_output_chain(output_schema=DocumentSchema, llm=llm, prompt=chain_prompt)
final_qa_chain_pydantic = StuffDocumentsChain(
llm_chain=chain,
document_variable_name="context",
document_prompt=doc_prompt,
)
retrieval_qa_pydantic = RetrievalQA(
retriever=vectorstore.as_retriever(), combine_documents_chain=final_qa_chain_pydantic
)
data = retrieval_qa_pydantic.run(question)
Depending on the file that I am checking, executing the script will raise an error because of the formats from Pydantic not being respected by the return of ChatGPT.
What am I missing here?
Thank you!