Optimizing sed expressions

Question

I have lines in a file that look like this:

2023-06-26 00:00:00|Execution error for request 52262275. Reason: ESS-07033 Job logic indicated a system error occurred while executing an asynchronous java job for request 52262275. Job error is: one or more subbatches has failed to generate feed.|111975431

And I have a pattern file that is almost 200 lines long, with lines that look like this:

s~(.*Execution error for request [0-9]+. Reason: ESS-07033 Job logic indicated a system error occurred while executing an asynchronous java job for request [0-9]+. Job error is: one or more subbatches has failed to generate feed..*)~\1|959361986~p
t
s~(.*Execution error for request [0-9]+. Reason:.*)~\1|1735893446~p

The idea is that this list of patterns gets more and more broad at matching as it goes, with the broadest at the bottom

What I am trying to do is match a given pattern to a line in a file, and append the ID in sed expression to the end of the line. So:

this line in the data file:

2023-06-26 00:00:00|Execution error for request 52262275. Reason: ESS-07033 Job logic indicated a system error occurred while executing an asynchronous java job for request 52262275. Job error is: one or more subbatches has failed to generate feed.|111975431

Should match this expression:

s~(.*Execution error for request [0-9]+. Reason: ESS-07033 Job logic indicated a system error occurred while executing an asynchronous java job for request [0-9]+. Job error is: one or more subbatches has failed to generate feed..*)~\1|959361986~p

And ultimately output: 111975431|959361986

I am currently doing this as such:

sed -E -n -f ./workdir/2023-08-11/2023-06-26-00-00-00_2023-07-20-23-59-59/templates.sed ./workdir/2023-08-11/2023-06-26-00-00-00_2023-07-20-23-59-59/data.psv
 | cut '-d|' -f3- |sort -t '|' -k1 -n

The idea being match the message with the regex, extract the number on the end and pair it with number in the sed expression.

This process works but takes forever! The data.psv file I am currently working on is 3.2gb and took 30 minutes to process. While this is not typical, I also can't rule it out. In addition, I expect the number of patterns to increase with time.

So, how can I optimize this process? I an change the order of the columns in the input file (I can remove the date field as this stage, if it helps, and say match ^foo\|[0-9]+$). I could split the file and fork a bunch of processes with xargs -P and then rejoin the files.

I am open to suggestions.

what percentage of lines in `data.psv` will match on one of the lines from `templates.sed`? if the percentage is (relatively) low, is there a common pattern(s) that could be used to filter out the unwanted lines from `data.psv`? the idea being to preprocess `data.psv` to reduce the number of lines that have to be processed by `sed` ... or has `data.psv` already been reduced as far as possible (ie, all lines in `data.psv` *will* match a line from `templates.sed`)? — markp-fuso, Aug 11 '23 at 18:39
All lines can/will need to match something in the sed file. So, 100% — Christian Bongiorno, Aug 11 '23 at 19:40
since the operation is going to be (mostly) cpu bound the easy/simple solution would be to split `data.psv` into a handful of smaller files and run parallel `sed` processes; another possibility, heavily dependent on detailed knowledge of the data, would be breaking the search criteria into two chunks: *1)* looking for specific strings (eg, `ESS-XXXXX`) that could be used against a lookkup/hash table (faster) vs *2)* (the current method) looking for generalized patterns (slower) — markp-fuso, Aug 11 '23 at 19:46
FWIW hash lookup was my original intent too before I noticed you can't just map ALL digits to something else before doing the lookup or even a string comparison and it's not clear if you could have other regexps in your search strings too. — Ed Morton, Aug 11 '23 at 19:57
out of curiosity does performance change if you rewrite `s/(.*re.*)/\1|num/p` as `/re/s/$/|num/p` ? — jhnc, Aug 11 '23 at 20:47
re: lookup/hash ... assume `ESS-07033` only shows up in the 1st `sed` string, we create `lookup[ESS-07033]=111975431`; while processing a line from `data.psb` , we could run a (relatively) quick match for a pattern like `ESS-[0-9]{5}` and if found then verify the match is an index in the `lookup[]` array and if true then the array match contains our new value; if you have a lot of `ESS-` then this would be a fast shortcut for processing the associated rows; gets tricky though if lots of different patterns and/or most rows don't have a unique string to match on — markp-fuso, Aug 11 '23 at 20:54
My hash lookup idea was to on-the-fly while reading the data change all the digits in the strings before the last field from both files into newlines or similar (because that can't be present in the original strings), then store the search strings from the "patterns" file as indices of an array (a hash), then use the strings from the "data" file to look up it's string in the array for a match. But you can't simply do that if some digits in those strings matter, e.g. the `07033` in `ESS-07033`. — Ed Morton, Aug 12 '23 at 12:05
If we knew more about the "patterns" you need to find we could help more. For example if the job number is **always** `ESS-07033` and you don't have any unique data values other than the numbers you currently match with `[0-9]+` then we're back to being able to at least do string instead of regexp comparisons which should produce a significant speedup. If you always want to match series of words from the start of the line to some minimum number of words then we can dynamically truncate the "pattern" and the target strings to do a hash lookup. But there's not enough information/examples to tell — Ed Morton, Aug 12 '23 at 12:36

Ed Morton · Answer 1 · 2023-08-12T13:21:14.273

Try using awk instead, e.g. something like (untested):

awk '
    BEGIN { FS=OFS="|" }
    NR==FNR {
        targets[++numTargets] = $1
        numbers[numTargets] = $2
        next
    }
    {
        for ( i=1; i<=numTargets; i++ ) {
            if ( $2 ~ targets[i] ) {
                print numbers[i], $3
                next
            }
        }
    }
' file1 file2

where file1 contains the strings to match on like this:

Execution error for request [0-9]+. Reason: ESS-07033 Job logic indicated a system error occurred while executing an asynchronous java job for request [0-9]+. Job error is: one or more subbatches has failed to generate feed.|959361986
Execution error for request [0-9]+. Reason:|1735893446

and file2 is the file to search in:

2023-06-26 00:00:00|Execution error for request 52262275. Reason: ESS-07033 Job logic indicated a system error occurred while executing an asynchronous java job for request 52262275. Job error is: one or more subbatches has failed to generate feed.|111975431

We talked a bit about hash lookups in the comments under the question, here's an example of what could be done depending on various criteria about your domain that we just don't know yet.

map_file (decreasing length substrings, segmented at :s):

Execution error for request 0. Reason: ESS-07033 Job logic indicated a system error occurred while executing an asynchronous java job for request 0. Job error is: one or more subbatches has failed to generate feed.|959361986
Execution error for request 0. Reason: ESS-07033 Job logic indicated a system error occurred while executing an asynchronous java job for request 0. Job error is|1234567890
Execution error for request 0. Reason|1735893446

data_file (same as today):

2023-06-26 00:00:00|Execution error for request 52262275. Reason: ESS-07033 Job logic indicated a system error occurred while executing an asynchronous java job for request 52262275. Job error is: one or more subbatches has failed to generate feed.|111975431

with this script:

awk '
    BEGIN { FS=OFS="|" }
    NR == FNR {
        gsub(/[0-9]+/,RS,$1)
        map[$1] = $2
        next
    }
    {
        gsub(/[0-9]+/,RS,$2)
        str = $2
        while ( ! (str in map) ) {
            if ( ! sub(/:[^:]*/,"",str) ) {
                break
            }
        }
    }
    str in map {
        print $3, map[str]
    }
' map_file data_file

it outputs:

111975431|959361986

since the longest "pattern" matches the data.

So we convert all strings of digits in the "pattern" and data to newlines (a value that didn't exist in the string previously) to ignore their values during the comparison and while reading the map_file we just populate a hash table, then when reading data_file we first do a hash lookup to see if the full $2 from data_file is in the map created from map_file, if it is print the associated numbers, if it isn't then remove everything from the final : on then try again, and so on.

But maybe we can't ignore all digits, maybe there's other things we could ignore the values of, and maybe iteratively truncating on :s isn't the right granularity for comparisons - we just don't know enough about what you're trying to do yet to be able to tell what kind of solution you need.

@jhnc I don't know exactly what that sed expression does (Is it stopping comparing regexps after the first match?) and I don't know how efficient sed is in general for this but the awk script is only matching on the 2nd field rather than the whole line, not matching on `.*`s at each end of each expression, and not using capture groups so I expect it'll be faster. — Ed Morton, Aug 12 '23 at 11:57
yes, `t` is a bit like `next` (branch if substitution changed anything). awk and sed presumably both have to scan to find the newline. I don't know if awk's autosplit is part of the scan or an extra pass. Could be a win. I just noticed you aren't using the regex from the question. That will definitely be faster. — jhnc, Aug 12 '23 at 12:08
I suspected `t` was like next but wasn't sure, thanks for confirming. In general if I can't do something in sed with s, g, and p then I move on to awk. I suspect awk is just faster at matching regexps in general since sed supports capture groups in the regexp while awk doesn't BUT awk does also have that field splitting step which sed doesn't so it'll be interesting to see what the OP reports from using both. — Ed Morton, Aug 12 '23 at 12:14
I'm fairly certain leaving out the leading `.*` will dominate — jhnc, Aug 12 '23 at 12:20
We could get the OP to add those back in after they've tried this to see just how much impact they have. — Ed Morton, Aug 12 '23 at 12:21

jhnc · Answer 2 · 2023-08-12T00:35:15.453

1

Using leading greedy .* has a large detrimental effect on your runtime due to unnecessary backtracking both when there is a match and when there isn't. Simply eliminating them will give you a significant speedup.

After removing the leading .*, https://regex101.com says your regex takes 218 steps to match instead of 458. Changing your data so it won't match about halfway along (eg. replacing occurred with x) takes 100 steps to fail instead of 353. Earlier failures will be even cheaper.

I believe Perl makes some guarantees about which alternation is selected.

If my understanding is correct, you could combine all the regex into one and match in parallel. This might give a useful speedup.

perl -F'[|]' -lE '
    say $F[2],"|",keys%+ if $F[1] =~ /
        (?<name1>re1)|
        (?<name2>re2)|
        (?<name3>re3)|
        ...
    /x
' data.psv | sort -n

-F splits line into array @F
hash %+ give access to the named capture buffers
each nameN will be the code to append (surprisingly, I don't think they have to be distinct)

A caveat is that the leftmost matching alternation is only selected if all regex that match start at the same character, so you should probably anchor your regex (for example, by prefixing each regex with ^.*? (which is non-greedy but will have detrimental effect when there is no match (less on matches))).

edited Aug 12 '23 at 00:35

answered Aug 11 '23 at 20:11

jhnc

11,310
1
9
26

after some experiments on regex101, it's not clear using alternations like this helps. regex101 thinks PCRE2 will try each alternation in series – jhnc Aug 12 '23 at 06:36
but looks like some other regex engine implementation **do** offer parallel processing. eg. https://github.com/openresty/sregex#sre_regex_parse_multi – jhnc Aug 12 '23 at 06:45
Do you know if perls regexp engine is still slower than awks due to having to handle look-arounds? There's some articles about that at https://swtch.com/~rsc/regexp/regexp1.html and https://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-and-most-elegant-big-data-munging-language/ (from before gawk was sped up during a restructure to be about as fast as mawk), and an anecdotal article at https://www.libertysys.com.au/2011/03/an-interesting-performance-difference-between-perl-and-awk/ about it but they're all over 10 years old. – Ed Morton Aug 12 '23 at 12:28
@EdMorton sorry, no idea. Suspect RE2 will beat both :-) – jhnc Aug 12 '23 at 12:40
I hadn't heard of RE2 before, interesting, thanks! From https://github.com/google/re2/wiki/WhyRE2 it looks like RE2 wass created for safety considerations rather than execution speed ("RE2 was designed and implemented with an explicit goal of being able to handle regular expressions from untrusted users without risk.... It is not a goal to be faster than all other engines under all circumstances."), but it doesn't support backreferences and lookarounds or some other PCRE constructs so I expect it will be faster than PCREs, not sure it'll be faster than EREs. – Ed Morton Aug 12 '23 at 12:43
I think it's developed from Russ Cox's work that you linked. While not perhaps designed with performance as a priority, never having to worry about unexpected pathological slowdowns sounds like a win. – jhnc Aug 12 '23 at 12:49
1

Agreed, but that's a PCRE performance concern, not an ERE concern. My take from the little reading I just did is that RE2 will be similar to EREs in terms of syntax and performance characteristics but with extra security considerations. – Ed Morton Aug 12 '23 at 12:50

Optimizing sed expressions

2 Answers2