Issue retrieving specific data from documents embedded into a vectorstore in langchain

Question

I'm looking for help about retrieving data from documents embedded in a vectorstore. I'm still pretty new to this, and I may miss something obvious.

The issue I'm facing is that some specific data from the documents don't seem to be found when using FAISS.similarity_search() from langchain. I've also tried max_marginal_relevance_search() and similarity_search_with_score() with no better results.

I've built a 8500 movies dataset in JSON, that I load with a custom JSONLoader, then split, before embedding the documents into a FAISS vectorstore. For the embed model I've tried : all-mpnet-base-v2, all-MiniLM-L12-v2, instructor-large, instructor-xl. All of them give the same results which leads me to think that the issue lies elsewhere.

Here is an example item of the documents embedded :

text = ""
text += "Title: " + movie['title'] + "\n"
text += f"Original title: {movie['original_title']}\n"
text += f"Release Date: {movie['release_date']}\n"
text += f"Genres: {movie['genres']}\n"
text += f"Nationality: {movie['original_language']}\n"
text += f"Score: {movie['vote_average']}/10\n"
text += f"Casting: {movie['actors']}\n"
text += f"Directors and writers: {movie['directors']}\n"

text += f"Overview: {movie['overview']}\n"
text += getReviews(movie)

metadata = dict(
    source=f"{self.file_path}-{movie['title']}",
    id=movie['id'],
    title=movie['title']
)
text += "\n\n"
return Document(page_content=text, metadata=metadata)

My problem is when I query about a person's name, it won't find anything unless the names appear in the reviews. For example, if I ask for "A movie directed by Louis Leterrier" it won't find Fast X, while being in stored in the DB. But if I ask for "A movie with Chris Pine" lot of movies with him will appear, since its name is also written in some reviews.

Even if I just query "Tyler Posey" which his name only appears once in the whole dataset for "Teen Wolf: The Movie", it won't give me this result (text being identical). Instead it will retrieve some completely random movies with no obvious match at first sight.

I've tried to build the documents from txt instead of json and loading from UnstructuredFileLoader.

I've tried replacing the list of actors by a more meaningful sentence.

I've tried removing the reviews from the documents to reduce the noise.

I've tried different chunk size from 400 to 1200 with something like 20% overlap using RecursiveCharacterTextSplitter.

I'm starting to run out of ideas, and any help would be welcome.

Edit : Here is some code detailling the whole process : And here is a link to a small part of the dataset : https://jsonblob.com/1128451472412131328

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.vectorstores.faiss import FAISS
from langchain.docstore.document import Document
import json
movies = []

def load():
    docs=[]
    for movie in movies:
        text = ""
        text += "Title: " + movie['title'] + "\n"
        text += f"Original title: {movie['original_title']}\n"
        text += f"Release Date: {movie['release_date']}\n"
        text += f"Genres: {movie['genres']}\n"
        text += f"Nationality: {movie['original_language']}\n"
        text += f"Score: {movie['vote_average']}/10\n"
        text += f"Casting: {movie['actors']}\n"
        text += f"Directors and writers: {movie['directors']}\n"
        
        metadata = dict(
            source=movie['id'],
            title=movie['title']
        )
        doc = Document(page_content=text, metadata=metadata)
        docs.append(doc)
    return docs

data = load()
# Splitting into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
data = text_splitter.split_documents(data)
# Embeddings into FAISS vector store
model_name = "instructor-large"
model_id = "hkunlp/instructor-large"

embed_model = HuggingFaceInstructEmbeddings(
    model_kwargs={
        "device": "cuda"    
    },
    model_name=model_id
)
vectorstore = FAISS.from_documents(data, embed_model)
#Fetching results from store
query = input("Please enter your movie description: ")
docs = vectorstore.similarity_search(query, k=3)
for doc in docs:
    print(doc.metadata['title'])

hey, can you put a minimum code that we can quickly run and reproduce this? — simpleApp, Jul 11 '23 at 20:50
pls add import statements and references to JSON data. if its a private dataset, pls add some mock-up data. — simpleApp, Jul 11 '23 at 21:34
The code is ready to use, you just have to import the json that I've linked in my post, into the code. Then you can start the script and query "A movie directed by Louis Leterrier", and it won't list Fast X, while it's the only item where its name appears. Actually testing on this small sample file, it seems I'm getting better results overall. So the size of the dataset is probably the issue. — JDupont1984, Jul 11 '23 at 22:35
I looked at `SentenceTransformerEmbeddings` embedding and `Chroma` similarity search, but this also did not give `Fast X` , are you sure Fast X should be in the search? received following in Chroma search, `Eradication Miraculous: Ladybug & Cat Noir, The Movie Extraction 2` — simpleApp, Jul 12 '23 at 03:09
I'm not sure how similarity search works, so I don't know how coherent this result is from a technical standpoint. But in practical use, I'm comparing "A movie directed by Louis Leterrier" to "director: Louis Leterrier" and it doesn't work. I get the same results as you, I can't even understand why those have been brought up. And the longer the dataset get, the less coherent the results are. — JDupont1984, Jul 12 '23 at 06:39

Issue retrieving specific data from documents embedded into a vectorstore in langchain

0 Answers0