10

Does anyone know how text editors/programmers editors are able to do such fast searches on very large text files.

Are they indexing on load, at the start of the find or some other clever technique?

I desperately need a faster implementation of what I have which is a desperately slow walk from top to bottom of the text.

Any ideas are really appreciated.

This is for a C# implementation, but its the technique I'm interested in more than the actual code.

Josh Lee
  • 171,072
  • 38
  • 269
  • 275
Andy
  • 739
  • 1
  • 9
  • 18
  • What is large in your case? Several gigs? – Torsten Marek Feb 10 '09 at 09:49
  • Also, will you have to search multi-lingual text? C# has built-in unicode support, but if you want to get fancy with search algorithms this may have an effect upon your performance. – Elijah Feb 10 '09 at 12:30

4 Answers4

6

Begin with Boyer-Moore search algorithm. It requires some preprocessing (which is fast) and does searching pretty well - especially when searching for long substrings.

Anton Gogolev
  • 113,561
  • 39
  • 200
  • 288
1

I wouldn't be surprised if most just use the basic, naive search technique (scan for a match on the 1st char, then test if the hit pans out).

Michael Burr
  • 333,147
  • 50
  • 533
  • 760
1

grep

Although not a text editor in itself, but often called by many text editors. I'm curious if you have you tried grep's source code? It always has seemed blazingly fast to me even when searching large files.

Elijah
  • 13,368
  • 10
  • 57
  • 89
  • "I'm going to beat grep by thirty percent!" http://ridiculousfish.com/blog/archives/2006/05/30/old-age-and-treachery/ – Josh Lee Feb 10 '09 at 14:22
0

One method I know of which is not yet mentioned is the Knuth-Morris-Pratt-Search (KMP), but it isn't so good for language texts (it's due to a prefixed property of the algorithm), but for stuff like DNA matching it is very very good.

Another one is a hash-Search (I don't know if there is an official name). First, you calc a hash value of your pattern and then you make a sliding window (with the size of your pattern) and move it over your text and seeing if the hashes match. The idea here is to choose the hash in a way that you don't have to compute the hash for the complete window but you update your hash just with the next char (and the old char drops out of the hash computation). This algorithm performs very very well when you have multiple strings to search for (because you just compute beforehand your hashes for your strings).

Yashwardhan Pauranik
  • 5,370
  • 5
  • 42
  • 65
flolo
  • 15,148
  • 4
  • 32
  • 57