Searching a file non-sequentially

Question

Usually when I search a file with grep, the search is done sequentially. Is it possible to perform a non-sequential search or a parallel search? Or for example, a search between line l1 and line l2 without having to go through the first l1-1 lines?

Correct, I use only bash on the terminal and when scripting. — Zeus, May 03 '15 at 01:06
How big is your file and are your lines the same size? If you have lines that are all the same size you can do fixed byte offsets which will be much faster. — b4hand, May 03 '15 at 01:39
A file can be the size of a book, let's say up to 1000 pages or a bit more. — Zeus, May 03 '15 at 01:41
That's tiny for a computer. You are very unlikely to see actual improved performance by parallelizing the task. — b4hand, May 03 '15 at 01:43
Things are not so fast because of time taken to access the file from disc, and the more printing there is. — Zeus, May 03 '15 at 01:47
Parallelizing still isn't the answer. You're more likely to see improved performance by suppressing output or changing your approach entirely depending on context. — b4hand, May 03 '15 at 01:55
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/76791/discussion-between-zeus-and-b4hand). — Zeus, May 03 '15 at 01:57

b4hand · Answer 1 · 2015-05-03T05:20:16.533

You can use tail -n +N file | grep to begin a grep at a given line offset.

You can combine head with tail to search over just a fixed range.

However, this still must scan the file for end of line characters.

In general, sequential reads are the fastest reads for disks. Trying to do a parallel search will most likely cause random disk seeks and perform worse.

For what it is worth, a typical book contains about 200 words per page. At a typical 5 letters per word, you're looking at about 1kb per page, so 1000 pages would still be 1MB. A standard desktop hard drive can easily read that in a fraction of a second.

You can't speed up disk read throughput this way. In fact, I can almost guarantee you are not saturating your disk read rate right now for a file that small. You can use iostat to confirm.

If your file is completely ASCII, you may be able to speed things up by setting you locale to the C locale to avoid doing any type of Unicode translation.

If you need to do multiple searches over the same file, it would be worthwhile to build a reverse index to do the search. For code there are tools like exuberant ctags that can do that for you. Otherwise, you're probably looking at building a custom tool. There are tools for doing general text search over large corpuses, but that's probably overkill for you. You could even load the file into a database like Postgresql that supports full text search and have it build an index for you.

Padding the lines to a fixed record length is not necessarily going to solve your problem. As I mentioned before, I don't think you have an IO throughout issue, you could see that yourself by simply moving the file to a temporary ram disk that you create. That removes all potential IO. If that's still not fast enough for you then you're going to have to pursue an entirely different solution.

Andrea Ratto · Answer 2 · 2015-05-03T06:17:40.930

if your lines are fixed length, you can use dd to read a particular section of the file:

dd if=myfile.txt bs=<line_leght> count=<lines_to_read> skip=<start_line> | other_commands

Note that dd will read from disk using the block size specified for input (bs). That might be slow and could be batched, by reading a group of lines at once so that you pull from disk at least 4kb. In this case you want to look at skip_bytes and count_bytes flags to be able to start and end at lines that are not multiple of your block size. Another interesting option is the output block size obs, which could benefit from being either the same of input or a single line.

score 0 · Answer 3 · answered May 03 '15 at 02:11

0

The simple answer is: you can't. What you want contradicts itself: You don't want to scan the entire file, but you want to know where each line ends. You can't know where each line ends without actually scanning the file. QED ;)

answered May 03 '15 at 02:11

Gerard van Helden

1,601
10
13

Fair enough. Suppose I make the file have 90 character lines by appending any necessary empty characters. What commands would I need to do to scan between lines `l1` to `l2`. – Zeus May 03 '15 at 02:27
In bash, you'd use `head` and `tail` as suggested before :) But they do the read anyway, so it wouldn't be more optimal than you might think. If you want this to be super efficient, you'd probably be better of writing a simple C program for it yourself. Basic file I/O isn't that hard to do, especially if it's only reading. Then you can seek (i.e. move the file pointer) within an open file without actually reading the data. – Gerard van Helden May 03 '15 at 02:35
Does that require fixed line lengths? – Zeus May 03 '15 at 02:40
It depends on how accurate you want it to be. Otherwise you'll always get back to the posed issue of scanning for newlines. – Gerard van Helden May 03 '15 at 02:50
Accurate?? What's that about? – Zeus May 03 '15 at 02:56

Searching a file non-sequentially

3 Answers3