Single-Pass File Scanning

Question

In my file scanning D program I'm implementing a logic for finding all hits of set of key strings together with line and column context similar to Grep.

My current algorithm works by calling find until end of file. When a hit is found I search backwards and forwards to detect byte offset for beginning and end of the hit line. Then I search backwards again to find number of newlines between beginning of file and my hit start offset. This if of course not an efficient nor elegant solution but it currently works and has helped understand how I operate on slices.

I now want to refactor this code to make use some combination of state machines (Monads) that only needs to go throw the file once and that updates and operates on an array of line-starts found so far (size_t[]). What std.algorithms should base such a solution upon? This algorithm should output a array of tuples where each tuple contains a hit-slice, bol/eol-slice and line-number.

score 2 · Accepted Answer · edited Sep 21 '13 at 17:18

2

it is much simpler and easier to just iterate over all lines and keep the current line number

foreach(n, line; lines(file))
{
    auto index = indexOf(line,needle);
    if(index>=0){
        writeln(n, ", ", index);
    }
}

edited Sep 21 '13 at 17:18

Nordlöw

11,838
10
52
99

answered Sep 20 '13 at 10:25

ratchet freak

47,288
5
68
106

Yes, but I want support for searching keys that span multiple-lines. And I want this to be in the same pass as the line counting. I could of course do a substring search in each line and then implement some special behaviour that continues to search on the next line. But I'm not sure that is the most general solution. What about continued call of the variadic-find with first needle being a newline character (or a set of needles if we want all versions of newline codings) and the rest the search strings? – Nordlöw Sep 20 '13 at 11:18
1

@Nordlöw you can split the needle into lines, when there is only 1 line use the version in the answer. When there are multiple lines you can do a `endsWith` for the first line of the needle and then a `equals` for the next lines and finally a `startsWith` with the last line of the needle – ratchet freak Sep 20 '13 at 11:20
BTW: When should I use the needle-variadic version of `find` instead of a `(ct)Regex`? Performance reasons? – Nordlöw Sep 20 '13 at 11:26
Note that `indexOf` is deprecated in favour of `countUntil`. I updated the code. – Nordlöw Sep 20 '13 at 11:44
The input file in my case is a memory file. Should I use an algorithm from `std.algorithm` or one from `std.string` to split the lines? I'm not currently assuming Unicode-correctness of the file so then I guess I should use `splitter` from `std.algorithm` right? – Nordlöw Sep 21 '13 at 17:57

Single-Pass File Scanning

1 Answers1