How to have my RetrievalQA search though embedding vectors of cleaned text, but pass in the uncleaned text for chat context?

Question

I am writing a customQA chatbot using Langchain, Chroma, and the GPT-API. Below you will see the function I use for instantiating a peristed database for my vectors.

def create_db(pdf_file):
    all_splits = split_and_clean_text(pdf_file)

    Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings(), persist_directory=db_loc)

It makes a call to the function 'split_and_clean_text' which brings all characters to lower case, removes all punctuation and special characters, and removes all stop words. shown below

def split_and_clean_text(pdf_file):
    loader = PyPDFLoader(pdf_file)
    pages = loader.load_and_split()
    text_splitter =  RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200, length_function = len)
    all_splits = text_splitter.split_documents(pages)

    stop = stopwords.words('english')

    for split in all_splits:
        split.page_content = split.page_content.lower()
        split.page_content = re.sub(r"(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", split.page_content)
        split.page_content = " ".join([word for word in split.page_content.split() if word not in stop])

    return all_splits

I also have a function called 'retrieve_db' that returns the persisted database. shown below

embedding = OpenAIEmbeddings()

def retrieve_db():
    return Chroma(persist_directory=db_loc, embedding_function=embedding)

and finally I have my function 'start_qa' shown below.

def start_qa():
    template = """
    Use the following context (delimited by <ctx></ctx>) and the chat history (delimited by <hs></hs>) to answer the question, if you can. Also, provide a direct quote from the text that supports your answer. If you can't answer the question using the information provided, just say "Based on the text provided, I do not know.":
    ------
    <ctx>
    {context}
    </ctx>
    ------
    <hs>
    {history}
    </hs>
    ------
    {question}
    Answer:
    """

    prompt = PromptTemplate(
        input_variables = ["history", "context", "question"],
        template = template
    )



    vector_db = retrieve_db()
    llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
    qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type='stuff',
    retriever=vector_db.as_retriever(),
    verbose=True,
    chain_type_kwargs={
        "prompt": prompt,
        "memory": ConversationBufferMemory(
            memory_key="history",
            input_key="question"),
    }
)

    while True:
        question = input("Ask a question(q to quit): ")
        if question.strip().lower() == "q":
            break
        
        result = qa_chain({"query": question})

        print(result['result'])

The issue I am having is that I want my QA bot to search through the embedded text but then pass in uncleaned text to the prompt context to improve clarity. I need everything to be stored in the peristed database but I cant figure out how to modify my database to also store the original uncleaned text. And then how to specify that it should search the cleaned and embedded text, and return the raw text?

How to have my RetrievalQA search though embedding vectors of cleaned text, but pass in the uncleaned text for chat context?

0 Answers0