10

I'm developing something quite similar to an IDE that will handle tens of thousands of very large (text) files and I'm surveying what the state of the art in the subject is.

As an example, Intellij's searching algorithm for standard (non-regex) expressions is pretty much immediate. How do they accomplish this? Are they just keeping some sort of suffix-tree of all the searchable files in memory? Are they just keeping a good portion of the file's contents in memory so they just do a standard KMP almost fully in-memory to avoid any disk IO?

Thanks

devoured elysium
  • 101,373
  • 131
  • 340
  • 557

3 Answers3

14

Currently, IntelliJ IDEA indexes files in the project, and remembers which 3-grams (sequences of 3 letters or digits) occur in which files. When searching, it splits the query into 3-grams as well, gets the files from the index that contain all those trigrams, intersects those sets and uses a relatively straightforward text search in each of those files to check if they really contain the whole search string.

Peter Gromov
  • 17,615
  • 7
  • 49
  • 35
1

You could take a look at Apache Lucene. It's a text search engine library written entirely in java. It may be a little bit too heavy for your use, but since it's open source, you could take a look at how it works.

It features a demo which leads you to build an index and search through the library source code, which sounds pretty much exactly like what you want to do.

Also, take a look at the Boyer-Moore string search algorithm. This is apparently commonly used in applications which offer a ctrl+f style document search. It involves pre-processing the search term so it can run as few comparisons as possible.

js441
  • 1,134
  • 8
  • 16
  • Hello. I know about Boyer-Moore about I'm under the impression KMP tends to perform better. I'll have to double-check the statement, though. – devoured elysium Sep 04 '16 at 22:27
  • Hi, I believe BM can have better than linear runtime when optimized and in certain situations. KMP runtime is always linear. Which one is better will depend on your text/search term lengths. I guess to decide which is better you'd have to determine an average use case and do the calculations. – js441 Sep 04 '16 at 22:37
  • I didn't -1 this post. – devoured elysium Sep 05 '16 at 06:07
1

As js441 pointed out Apache Lucene is a good option but only if you are going to do term based search, similar to how google works. If you need to search arbitrary strings that span the terms Lucene will not help you.

In the later case you are right, you have to build some sort of suffix tree. A neat trick you can do after you have built a suffix tree is to write it to the file and mmap it into memory space. This way you will not waste memory to keep entire tree in RAM, but you will have frequently accessed portions of the tree automatically cached. The drawback to mmap is that initial searches might be somewhat slow. Also this will not if your files change often.

To help the case of searching just edited files, you can keep two indices, one for the bulk of your files and another one just for the recently edited files. So when you do the search you will search in both indices. Periodically you should rebuild the permanent index with the contents of the new files and replace the old one.

Here are some examples of when Lucene is good and when suffix tree is good:

Assume you have a document that contains the following:

A quick brown dog has jumped over lazy fox.

Lucene is good for the following searches:

  • quick
  • quick brown
  • q*
  • q* b

    With some tricks you can make the following searches work well:

  • '*ick *own'

    This type of search will run very slow

  • 'q*ick brown d*g'

    And this type of search will never find anything

  • "ick brown d"

    Lucene is also good when you treat your documents as bags of words. So you can easily do searches like this

  • quick fox

    Which will find you all documents that have words quick and fox no matter what is in the middle.

    On the other hand suffix trees work well with search for exact matches of substrings within the document, even in cases when your search is spans the terms and starts and ends in the middle of the term.

    Very good algorithm for constructing suffix trees of large arrays is described here (Warnign paywalled).

Vlad
  • 9,180
  • 5
  • 48
  • 67