35

I need to search something in a huge log-file (over 14 GB). I'm pretty sure it's in the last 4 GB or so.

Is there a way to skip the first X GB to speed things up?

Peter Mortensen
  • 2,318
  • 5
  • 23
  • 24
Roger
  • 800
  • 1
  • 6
  • 17
  • 7
    `LC_ALL=C grep` may speed it up. – jfs Jan 31 '17 at 14:22
  • 1
    You will be able to get a lot of speed by picking a sensible `grep` expression... wildcards of unknown length (like `a.*thing`) will in some cases take much longer to evaluate. It may be that you are optimizing for the wrong thing (although it never hurts to search only part of the file, obviously - it may just not be the greatest source of speedup). – Floris Feb 04 '17 at 04:04

3 Answers3

75

I guess you could use tail to only output that last 4GB or so by using the -c switch

-c, --bytes=[+]NUM
output the last NUM bytes; or use -c +NUM to output starting with byte NUM of each file

You could probably do something with dd too by setting bs=1 and skiping to the offset you want to start e.g.

dd if=file bs=1024k skip=12g | grep something
user9517
  • 115,471
  • 20
  • 215
  • 297
  • Thanks! you saved me a tone of time! I added the command as i used it to my question. Works a charm! – Roger Jan 31 '17 at 08:50
  • 83
    Afterwards, you should configure logrotate. – Gerald Schneider Jan 31 '17 at 08:51
  • @GeraldSchneider yeah i did. I thought it was setup but it wasn't. – Roger Jan 31 '17 at 09:28
  • 3
    @Rogier Please add an answer with the solution instead of adding it in your question. This is similar to self-answer: http://serverfault.com/help/self-answer – A.L Jan 31 '17 at 11:16
  • The OP doesn't need to add an answer. they clearly used mine. – user9517 Jan 31 '17 at 11:19
  • @istheEnglishway maybe you can add my `command` as-well (i then remove my update)? its slightly different. I already accepted your answer, and didn't want to add another answer. – Roger Jan 31 '17 at 11:30
  • 5
    @istheEnglishway: Well, no, they posted a different command. – Lightness Races in Orbit Jan 31 '17 at 13:39
  • 2
    @LightnessRacesinOrbit You need to read my answer, the first comment above and possibly the edit histories before poking your nose in. The OP implemented a solution using `tail -c` which is what my answer suggests. – user9517 Jan 31 '17 at 13:42
  • 11
    But your answer doesn't provide the actual command that implements that solution, which is added value. You could edit that into your answer, or the OP could post it as a new answer. They definitely shouldn't add it to the question, which is what happened. And you definitely shouldn't be throwing around epithets like "poking your nose in". – Lightness Races in Orbit Jan 31 '17 at 13:53
  • It clearly doesn't need to provide the actual command. I would also argue that providing the actual command is less useful to other people that may find this Q&A going forward. Just look at the mess that [tag:mod-rewrite] and [tag:virtualhost] . People are entirely unable to take specifics and apply them to their own exactly the same but different issue and want personalised hand holding. – user9517 Jan 31 '17 at 14:02
  • 7
    @istheEnglishway, believe it or not having an example make things easier than having to read a man page (see also : stackoverflow documentation) – Pierre.Sassoulas Jan 31 '17 at 14:43
  • 1
    @Pierre.Sassoulas believe it or not I disagree with you. Re: SO documentation - just another write only medium that no one who needs to will ever bother reading. – user9517 Jan 31 '17 at 14:45
  • 1
    `bs=1` will be incredibly slow. Like 1000x to 10000x slower than normal reads. – R.. GitHub STOP HELPING ICE Feb 02 '17 at 13:18
33

I'm just posting this because some of the comments asked for it.

What I end-up using was (15 GB file). It worked very fast and saved me a ton of time.

tail -f -c 14G file | grep something

I also did a very rudimentary benchmark on the same file. I tested:

grep xxx file
// took for-ever (> 5 minutes)

dd if=file bs=1 skip=14G | grep xxx
// very fast < 1 sec

tail -c 14g | grep xxx
// pretty fast < 2 sec

the tail is just a bit shorter.

NB: the suffix used g and G differ per command (Ubuntu 15.10)

Roger
  • 800
  • 1
  • 6
  • 17
  • Did you clear the disk cache between the benchmarks? I suspect most of the time in the first one was I/O. The speedup should be on the order of 15×, not 300×. – Reid Feb 02 '17 at 00:05
  • 2
    @Reid i didn't. But i did run *each* command multiple times. Im pretty sure that *dd* or *tail* will boost the speed significantly over just *grep* (cache or not). – Roger Feb 02 '17 at 07:19
19

This doesn't answer the Title question, but it will do what you are wanting to do. Use tac to reverse the file, then use grep to find your string. If your string only occurs once or a known number of times in the file, then let it run until it finds the known number of occurrences. That way, if your assumption about where it is in the file is incorrect, it will still find it. If you do want to limit it, you can use head to do that. The head command would go between the tac and the grep.

So the command looks like:

tac < logfile | grep myString
Itsme2003
  • 311
  • 1
  • 6