ok... it is a vast subject. Actually elasticsearch has been done exactly for that. But google too. from elasticSearch to google search there is a technologic gap.
If you go for your personnal implementation it is still possible... but there is a ton of work to do, to be as efficient as elasticsearch. The quick answer is : use elasticsearch.
You may be curious or for some reason you may need to write it by yourself. So how it works :
TFIDF and cosine distance
As you specified first you will tokenize
then you will represent the tokenized text as vector, then measure the angular distance between the text and the search word.
imagine have only 3 word in your language "foo, bar, bird"
so a text with "foo bar bird" can be represented by a vector3[1,1,1]
a text with
A) "foo foo foo foo bird" will be [4,0,1]
another
B) "foo bar" [1,1,0]
if you search for "bar" which is represented by [0,1,0] you will look for the text with which have the minimal angular distance and if you compute the angular distance between your search and B I think this is 90° which is lower than A.
Actually the language is more than 3 word so you will compute the distance in a vector of many more dimensions because a 1 world = 1 dimension :)
TFIDF stand for term frequency inverse document frequency.
this rates the frequency of a word in a document by the inverse of the frequencies of this word in all document. What it does is it point the important word in a document.
Let explain that :
- the word "that, in a the" etc are everywhere so they are not important
- in all of your texts the word "corpus" has a frequency of I don't know 0.0001%
- in a particular text it is cited 5 times hand has a frequencie of 0.1%
- then it is quite rare in the corpus but quite important in your text in comparison
- so when you'll search for "corpus" what you want is to get first the text where it appears 4 times.
- so instead of vector of number of occurence you will have a vector of relative occurence frequency for exemple [0.2,1,0.0005]
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
I have got a little surprise for you at the end : this is what is behind
the scoring in elasticsearch https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
in addition elasticsearch will provide to you replication scalability, distribution and anything you could dream about.
There is still reason to not use elasticsearch :
- research purpose
- you actually write another search engine like duckduck go (why not)
- you are curious
the scaling and distribution part
Read about lucene before https://en.wikipedia.org/wiki/Apache_Lucene
it depend greatly on the volume of text to be indexed and word.
If it represent 1M of text you dont need to distribut it.
you will need a distributed inverted index if you index something big like wikipedia (wikipedia uses elasticsearch for the searchbox)
foo is in text A,B,C,R
so I will partition my index
I will use a distributed cache with word for the keys and a list of pointer to the vectors as values. I would store the values in memory mapped file.
A search engine is something that must be fast, so if you do the thing by yourself you will reduce the external libraries. You will use c++
At google they end in a situation where the vectors take so much space that they need to store them in multiple machine so they invented GFS this is a distributed file system.
the cosine distance computation between multidimensionnal vectors are time consuming, so I will go for a computation in GPU, because GPU are efficient for floating point operations on matrix and vectors.
Actually this is a little bit crasy to reimplement all of that expect you have a good reason to do so for exemple a very good business model :)
I will probably use kubernetes docker and mesos to virtualise all my component. I will look for something similar to GFS if high volume is needed.
https://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems
You will need to get the text back, so I will use any NIO webserver that scale in any language. I will use nginx to serve the static pages and something like netty or vertx to get the search, then build the answer of link to the text (it depend on how many user you want to serve in one second)
All of that if I plan to index something bigger than wikipedia. And if I plan to invent something better than elasticsearch (hard task good luck)
for exemple wikipedia this is less than 1T of text.
finally
If you do it with elasticsearch in one week you are done may be 2 week and in production. If you do it by yourself, you will need at least 1 senior developer and a datascientist an architect, and something like one year or more depending on the volume of text you want to index. I cant help to stop asking myself "what is it for".
Actually if you read the code source of lucene you will know exactly what you need. They done it, lucene is the engine of elasticsearch.
Twitter is using Lucene for its real time search