Langchain: different knowledge depending on language

Question

I'm trying to train a chatbot with domain-specific knowledge (in particular real estate in Switzerland). I created a chatbot, which I feed some information based on a PDF and then I'm running a chatbot with memory function. It works pretty well, in multiple languages even. So I was curious if the knowledge of the chatbot is limited to only the custom knowledge, or if it has some pre-trained knowledge from the model. I first asked some domain specific questions (in English), which were all answered correctly. Then I asked some general knowledge, where the chatbot answered "I don't know". So I concluded there is no "outside" knowledge. Then I randomly asked the same question in German ("what's the capital of Switzerland?"), and suddenly it knew the correct answer.

Is this normal behaviour or is this some kind of bug?
Is there a way I can tell the chatbot to focus only on the custom knowledge/to include pre-trained general knowledge?

I couldn't find anything related to this in the LangChain documentation.

Here the code I'm using:

import os
import pandas as pd
import matplotlib.pyplot as plt
from transformers import GPT2TokenizerFast
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.chains import ConversationalRetrievalChain
from IPython.display import display
import ipywidgets as widgets

os.environ["OPENAI_API_KEY"] = "..."

# STEP 1: Split by chunk

# Convert PDF to text
import textract
doc = textract.process("./Allgemeine Bedingungen.pdf")

# Save to .txt and reopen
with open('Allgemeine Bedingungen.txt', 'w') as f:
    f.write(doc.decode('utf-8'))

with open('Allgemeine Bedingungen.txt', 'r') as f:
    text = f.read()

# Create function to count tokens
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

def count_tokens(text: str) -> int:
    return len(tokenizer.encode(text))

# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 512,
    chunk_overlap  = 24,
    length_function = count_tokens,
)

chunks = text_splitter.create_documents([text])

# STEP 2: Embed text and store embeddings

# Get embedding model
embeddings = OpenAIEmbeddings()

# Create vector database
db = FAISS.from_documents(chunks, embeddings)

# STEP 3: Setup retrieval function

chain = load_qa_chain(OpenAI(temperature=0), chain_type="stuff")

query = "Was ist die Unterhaltspflicht des Mieters?"
docs = db.similarity_search(query)

chain.run(input_documents=docs, question=query)

# STEP 4: Create chatbot with chat memory

qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0.1), db.as_retriever())

chat_history = []

def on_submit(_):
    query = input_box.value
    input_box.value = ""

    if query.lower() == 'stop':
        print("Cheers!")
        return

    result = qa({"question": query, "chat_history": chat_history})
    chat_history.append((query, result['answer']))

    display(widgets.HTML(f'<b>User:</b> {query}'))
    display(widgets.HTML(f'<b><font color="blue">Chatbot:</font></b> {result["answer"]}'))

print("Welcome! Type 'stop' to quit.")

input_box = widgets.Text(placeholder='Enter your question:')
input_box.on_submit(on_submit)

display(input_box)

Here the ouput I see for English and then German

score 0 · Answer 1 · answered Sep 03 '23 at 07:11

Grüzi! I'm facing a very similar challenge. AFAIK, since the source documents are in German, all the vector embeddings calculated for those texts are based on German semantics. Therefore, if a query/question is typed in English, the vector embeddings calculated for that question are very different, thus not retrieving any relevant matches from the similarity search.

My suggestion (or at least how I am attempting to work with two languages):

Inside the chunks Document object's metadata dictionary, include an additional key i.e. metadata = {'language': 'DE'}, and use SelfQueryRetriver (LangChain Documentation). You must provide the AI with the metadata and instruct it to translate any queries/questions to German and use it to retrieve the relevant chunks with the 'language': 'DE' metadata. NOTE: However, you'll have to investigate whether SelfQuery works with FAISS.
Prompt the AI to generate its finale response in English based on the retrieved chunks in German.

Langchain: different knowledge depending on language

1 Answers1