Data structure to index entire document and algorithm for quick search of any size substring

Question

I'm trying to find a data structure (and algorithm) that would allow me to index an entire text document and search for substring of it, no matter the size of the substring. The data structure should be stored in disk, during or at the end of the indexing procedure.

For instance, given the following sentence:

The book is on the table

The algorithm should quickly (O(log(n))) find the occurrences of any subset of the text.

For instance, if the input is book it should find all occurrences of it, but this should also be true for book is and The book is.

Unfortunately, the majority of solutions work by tokenizing the text and making searches using individual tokens. Ordinary databases also index any text without worrying about subset searching (that is why SELECT '%foo%' is done with linear search and takes a lot?).

I could try to develop something from scratch (maybe a variation of reverse index?) but I'd love to discover that somebody did that.

The most similar thing I found is SQLite3 Full-text search.

Thanks!

score 4 · Answer 1 · answered Jan 21 '16 at 20:09

One approach is to index your document in a suffix tree, and then - each prefix of some suffix - is a substring in the document.

With this approach, all you have to do, is build your suffix tree, and upon querying a substring s, follow nodes in the tree, and if you can follow through the entire query string - it means there is a suffix, which its prefix is the query string - and thus it is also a substring.

If you are querying only complete words, inverted index could be just enough. Inverted index is usually mapping a term (word) to a list of documents it appears in. Instead, for you it will mapping to locations in the document.

Upon query, you need to find for each occurance of word i in the query, its positions (let it be p), and if term i+1 of your query, appears as well in position p+1.

This can be done pretty efficiently, similarly to how inverted index is traditionally doing AND queries, but instead of searching all terms in same document, search terms in increasing positions.

Thanks! This is something very similar to what I was looking for! How would I store it in disk? Is there any variation of it? Why not an ordinary prefix tree? — Silas, Jan 21 '16 at 20:17

Data structure to index entire document and algorithm for quick search of any size substring

1 Answers1