I'm looking for help about retrieving data from documents embedded in a vectorstore. I'm still pretty new to this, and I may miss something obvious.
The issue I'm facing is that some specific data from the documents don't seem to be found when using FAISS.similarity_search() from langchain. I've also tried max_marginal_relevance_search() and similarity_search_with_score() with no better results.
I've built a 8500 movies dataset in JSON, that I load with a custom JSONLoader, then split, before embedding the documents into a FAISS vectorstore. For the embed model I've tried : all-mpnet-base-v2, all-MiniLM-L12-v2, instructor-large, instructor-xl. All of them give the same results which leads me to think that the issue lies elsewhere.
Here is an example item of the documents embedded :
text = ""
text += "Title: " + movie['title'] + "\n"
text += f"Original title: {movie['original_title']}\n"
text += f"Release Date: {movie['release_date']}\n"
text += f"Genres: {movie['genres']}\n"
text += f"Nationality: {movie['original_language']}\n"
text += f"Score: {movie['vote_average']}/10\n"
text += f"Casting: {movie['actors']}\n"
text += f"Directors and writers: {movie['directors']}\n"
text += f"Overview: {movie['overview']}\n"
text += getReviews(movie)
metadata = dict(
source=f"{self.file_path}-{movie['title']}",
id=movie['id'],
title=movie['title']
)
text += "\n\n"
return Document(page_content=text, metadata=metadata)
My problem is when I query about a person's name, it won't find anything unless the names appear in the reviews. For example, if I ask for "A movie directed by Louis Leterrier" it won't find Fast X, while being in stored in the DB. But if I ask for "A movie with Chris Pine" lot of movies with him will appear, since its name is also written in some reviews.
Even if I just query "Tyler Posey" which his name only appears once in the whole dataset for "Teen Wolf: The Movie", it won't give me this result (text being identical). Instead it will retrieve some completely random movies with no obvious match at first sight.
I've tried to build the documents from txt instead of json and loading from UnstructuredFileLoader.
I've tried replacing the list of actors by a more meaningful sentence.
I've tried removing the reviews from the documents to reduce the noise.
I've tried different chunk size from 400 to 1200 with something like 20% overlap using RecursiveCharacterTextSplitter.
I'm starting to run out of ideas, and any help would be welcome.
Edit : Here is some code detailling the whole process : And here is a link to a small part of the dataset : https://jsonblob.com/1128451472412131328
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.vectorstores.faiss import FAISS
from langchain.docstore.document import Document
import json
movies = []
def load():
docs=[]
for movie in movies:
text = ""
text += "Title: " + movie['title'] + "\n"
text += f"Original title: {movie['original_title']}\n"
text += f"Release Date: {movie['release_date']}\n"
text += f"Genres: {movie['genres']}\n"
text += f"Nationality: {movie['original_language']}\n"
text += f"Score: {movie['vote_average']}/10\n"
text += f"Casting: {movie['actors']}\n"
text += f"Directors and writers: {movie['directors']}\n"
metadata = dict(
source=movie['id'],
title=movie['title']
)
doc = Document(page_content=text, metadata=metadata)
docs.append(doc)
return docs
data = load()
# Splitting into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
data = text_splitter.split_documents(data)
# Embeddings into FAISS vector store
model_name = "instructor-large"
model_id = "hkunlp/instructor-large"
embed_model = HuggingFaceInstructEmbeddings(
model_kwargs={
"device": "cuda"
},
model_name=model_id
)
vectorstore = FAISS.from_documents(data, embed_model)
#Fetching results from store
query = input("Please enter your movie description: ")
docs = vectorstore.similarity_search(query, k=3)
for doc in docs:
print(doc.metadata['title'])