3

I recently made an index of approx. 2,000,000 documents in memory. The documents are imported from mysql datbase and takes about 6 to 10 seconds to load. Every time when I start the program time is consumed in importing data. I have tried using json, pickle, cPickle and even redis but time is concern and for update I have to restart the whole program. I am using python here.

My question is that how search engines like google, solr, elasticsearch stores inverted indexes. Do they store them in memory as hash-tables or in database? How index are updated without restart? What can be the best database for such purpose.

  • 1
    *The documents are imported from mysql datbase*. Then why do you build an index in memory, while you could use a mysql index directly? – Serge Ballesta Mar 12 '20 at 12:26
  • @Serge I am importing some test dataset from mysql. My real goal here is to develop search functionality with scraped data. –  Mar 12 '20 at 14:14

1 Answers1

5

Short Answer:

You don't need to load everything in memory because this process can be particularly slow for large document collections (worse, the inverted index may not even fit in memory).

Long Answer:

The inverted index is typically stored on the disk and is loaded on a dynamic basis depending on the query... e.g. if the query is "stack overflow", you hit on the individual lists corresponding to the terms 'stack' and 'overflow'...

The file structure for an inverted list is a mix of both fixed length and variable length components. Variable length information is stored as pointers.

Since terms (essentially strings) are of variable length, they are converted to integers (fixed length of 4/8 bytes). The mapping is usually stored in-memory as a hash-table (#terms is usually not that large in the order of 100K which easily fits in memory).

Given a term you have to look it up on the in-mem hashtable and get its id. You then use the id to directly jump (random access with offset) to its location on disk. This location contains a pointer to the list of documents containing that term (this list is variable length), which you have to load in memory.

Once you load the postings for all query terms (usually not a large number), you could aggregate the scores for all documents by walking through these lists (usually these lists are sorted by document ids).

A schematic diagram of the above description: enter image description here

Debasis
  • 3,680
  • 1
  • 20
  • 23
  • 1
    If the term list is much more than 100k? like phrases, it could be combination of 2-7 words. it wouldn't fit in memory. then how do i store them in file? – TomSawyer Apr 09 '20 at 09:24
  • 1
    I also read like this on Stanford NLP. Thanks. I think it can be implemented using python mmap function. Can you provide me some open source stuff for this. It will be very helpful. –  May 04 '20 at 13:57
  • Phrases (or in general word n-grams) can be handled by storing positions of terms in the index... you need not need to store the higher order n-grams themselves... e.g. if u want to search the phrase 'New York' then u will hit on the postings for 'New' and 'York' and only filter out those documents where the matching positions (for the same document) differ by 1... This is called positional indexing (https://stackoverflow.com/questions/6178083/position-offset-for-phrase-queries-in-lucene) – Debasis May 04 '20 at 16:08
  • 1
    Actually I implemented a positional index on my system using python. The position of the next word is greater than the previous one. The words which are much closer gets more ranking. It is a nice technique. –  May 05 '20 at 08:12