4

I have a 10 TB file with words from multiple books, and I'm trying to grep for some uncommon strings (no regex). For example:

grep "cappucino" filename

I'm trying to estimate how long this will take. I'm not really looking for whether it's the right approach or not. I'd like to learn more about what really happens under the hood when I call grep.

Please correct me if I'm wrong:

I use mechanical harddrive with roughly 200 MB/s read speed, so it will take roughly 10 million / 200 = 50000 seconds = 14 hours to finish. Is this an accurate estimate?

Popcorn
  • 5,188
  • 12
  • 54
  • 87
  • And where is the time for calculations (words compare etc?) – bksi Sep 02 '14 at 02:10
  • 2
    I think you should also take the data processing time in CPU into account too. – Lawrence Choy Sep 02 '14 at 02:10
  • is this a regexp search or no? – Mateusz Dymczyk Sep 02 '14 at 02:19
  • @Popcorn: *grep* **is** a regex; the acrononym stands for *General Regular Expression Parser*. – Pieter Geerkens Sep 02 '14 at 02:31
  • @Popcorn: Are you truly scanning for occurrences (or absence of same) regular expressions, or should you be writing a simple hand-rolled parser? **Not every text-string search problem is amenable to efficient use of a regex**. – Pieter Geerkens Sep 02 '14 at 02:33
  • @PieterGeerkens but with -F (or fgrep) it will treat the pattern as a fixed string and that was the question – Mateusz Dymczyk Sep 02 '14 at 02:33
  • is it a sentence/phrase , a word or multiple words that you are looking for? – Nick Maroulis Sep 02 '14 at 02:35
  • @MateuszDymczyk: That is your interpretation of the question; I see no definitive statement by OP (in the question) to confirm that. – Pieter Geerkens Sep 02 '14 at 02:35
  • @PieterGeerkens I asked him is that an regexp search (which was a shorthand for "are you doing -F or not") and he answered, what's the problem? – Mateusz Dymczyk Sep 02 '14 at 02:35
  • I've edited the question with an example, I'm mainly concerned with the estimated time, not whether it's the right approach or not. I'd like to learn more about what really happens under the hood when I call grep. – Popcorn Sep 02 '14 at 03:00
  • 1
    @PieterGeerkens wrt the name grep, it's derived from `g/re/p` which are the `ed` commands to Globally find a Regular Expression and Print the line. It does not stand for General Regular Repression Parser – Ed Morton Sep 02 '14 at 03:43
  • We can expect this search to be IO bound, so your estimate based on raw speed will be a good approximation. The FSM implementation in grep does not have to match every character of the file; unless the processor is *really* slow it should easily keep up. The complication could come if you want to search for several strings at the same time. That will slow the in-memory progress, but would still be multiple times faster that doing the searches one after the other. – andy256 Sep 02 '14 at 04:40
  • Out of curiosity (and some personal desperation)... do you remember how long it took? – Hashim Aziz Sep 08 '18 at 23:21

1 Answers1

5

The short answer is: no.

The longer answer is: it depends.

The even longer answer is: grep's performance depends on a lot of things:

  • are you running a fixed string search (-F, fgrep) or not - grep uses Boyer-Moore algorithm which by itself isn't capable of finding regular expressions so what grep does (or at least used to do) is it first finds a fixed string in your regexp, tries to find it using BM in the text and do a regexp match (not sure about the current implementation whether it uses an NFA or a DFA implementation, probably a hybrid)
  • how long is your pattern - BM works faster for longer patterns
  • how many matches will you have - the less the matches the faster it will be
  • what is your CPU and memory - hard drive will help you only during reading not during computation time
  • what other options are you using with your grep
  • 14 hours might not even be your lower bound because Boyer-Moore is smart enough to compute an offset at which next possible match might occur so it doesn't need to read-in the whole file. This does depend on the implementation though and is just my speculation. After re-running the below test with a much longer pattern I was able to go down to 0.23sec and I don't think my disk is that fast. But there might be some caching involved instead.

For instance I'm running on a 500MB/s SSD (at least that's what the manufacturer says) and grepping a 200MB file with a very short pattern (few chars) gives me:

With 808320 hits

real    0m1.734s
user    0m1.334s
sys 0m0.120s

With 0 hits:

real    0m0.059s
user    0m0.046s
sys 0m0.016s

@Edit: in short read about Boyer-Moore :-)

@Edit2: well to check how grep works you should instead check the source code, I described a very general workflow above.

Mateusz Dymczyk
  • 14,969
  • 10
  • 59
  • 94