How can we design system for document search?

Question

I was recently asked a system design question where I needed to "design system for document search" and first thing came in my mind was how elastic search works. So I came up with inverted index approach that is used to support text search. The inverted index has a record for each term. Each record has the list of documents that the term appears in. Documents are identified by an integer document ID. The list of document IDs is sorted in ascending order.

So I said something along below line but I am not sure whether this is the way it should work in distributed fashion because we may have lot of documents to be indexed so we need some load balancer and somehow we need to partition inverted index or data. Meaning a process that will upload the documents and then what process will tokenize it (will it be just one machine or bunch of machines). Basically wanted to understand what's the right way to design a system like this with proper components. How should we talk about this in system design interview? What are the things we should touch in an interview for this problem?

What's the right way we should design system for document search in distributed fashion with right components.

Damien MIRAS · Accepted Answer · 2019-05-26T15:36:45.707

ok... it is a vast subject. Actually elasticsearch has been done exactly for that. But google too. from elasticSearch to google search there is a technologic gap.

If you go for your personnal implementation it is still possible... but there is a ton of work to do, to be as efficient as elasticsearch. The quick answer is : use elasticsearch.

You may be curious or for some reason you may need to write it by yourself. So how it works :

TFIDF and cosine distance

As you specified first you will tokenize then you will represent the tokenized text as vector, then measure the angular distance between the text and the search word.

imagine have only 3 word in your language "foo, bar, bird" so a text with "foo bar bird" can be represented by a vector3[1,1,1] a text with

A) "foo foo foo foo bird" will be [4,0,1] another

B) "foo bar" [1,1,0]

if you search for "bar" which is represented by [0,1,0] you will look for the text with which have the minimal angular distance and if you compute the angular distance between your search and B I think this is 90° which is lower than A.

Actually the language is more than 3 word so you will compute the distance in a vector of many more dimensions because a 1 world = 1 dimension :)

TFIDF stand for term frequency inverse document frequency. this rates the frequency of a word in a document by the inverse of the frequencies of this word in all document. What it does is it point the important word in a document.

Let explain that :

the word "that, in a the" etc are everywhere so they are not important
in all of your texts the word "corpus" has a frequency of I don't know 0.0001%
in a particular text it is cited 5 times hand has a frequencie of 0.1%
then it is quite rare in the corpus but quite important in your text in comparison
so when you'll search for "corpus" what you want is to get first the text where it appears 4 times.
so instead of vector of number of occurence you will have a vector of relative occurence frequency for exemple [0.2,1,0.0005]

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

I have got a little surprise for you at the end : this is what is behind the scoring in elasticsearch https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

in addition elasticsearch will provide to you replication scalability, distribution and anything you could dream about.

There is still reason to not use elasticsearch :

research purpose
you actually write another search engine like duckduck go (why not)
you are curious

the scaling and distribution part

Read about lucene before https://en.wikipedia.org/wiki/Apache_Lucene

it depend greatly on the volume of text to be indexed and word.

If it represent 1M of text you dont need to distribut it. you will need a distributed inverted index if you index something big like wikipedia (wikipedia uses elasticsearch for the searchbox)

foo is in text A,B,C,R so I will partition my index

I will use a distributed cache with word for the keys and a list of pointer to the vectors as values. I would store the values in memory mapped file. A search engine is something that must be fast, so if you do the thing by yourself you will reduce the external libraries. You will use c++

At google they end in a situation where the vectors take so much space that they need to store them in multiple machine so they invented GFS this is a distributed file system.

the cosine distance computation between multidimensionnal vectors are time consuming, so I will go for a computation in GPU, because GPU are efficient for floating point operations on matrix and vectors.

Actually this is a little bit crasy to reimplement all of that expect you have a good reason to do so for exemple a very good business model :)

I will probably use kubernetes docker and mesos to virtualise all my component. I will look for something similar to GFS if high volume is needed. https://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems

You will need to get the text back, so I will use any NIO webserver that scale in any language. I will use nginx to serve the static pages and something like netty or vertx to get the search, then build the answer of link to the text (it depend on how many user you want to serve in one second)

All of that if I plan to index something bigger than wikipedia. And if I plan to invent something better than elasticsearch (hard task good luck)

for exemple wikipedia this is less than 1T of text.

finally

If you do it with elasticsearch in one week you are done may be 2 week and in production. If you do it by yourself, you will need at least 1 senior developer and a datascientist an architect, and something like one year or more depending on the volume of text you want to index. I cant help to stop asking myself "what is it for".

Actually if you read the code source of lucene you will know exactly what you need. They done it, lucene is the engine of elasticsearch.

Twitter is using Lucene for its real time search

Thanks for sharing your thoughts. Other part I wanted to understand was from the system design perspective what are the components that I will have that should do this job in a distributed fashion way? Basically to help interviewer understand what each component will do. — flash, May 25 '19 at 21:31
Thanks. This will help me to give a broad idea how we should design a system like this which does the job and what each component responsibility is basically. Since I know the basics now so a broader component design will give me clear picture. — flash, May 25 '19 at 21:46
if you are a research laboratory...yes you may need all that kind of components, and if I dont know the exact need I can just give you a vague nature of the architecture. — Damien MIRAS, May 25 '19 at 22:12
if you are a business you will need 3 component only one to collect the document and insert them in elasticsearch, another to give them back, and elasticsearch itself. — Damien MIRAS, May 25 '19 at 22:13

How can we design system for document search?

1 Answers1