I have lines in a file that look like this:
2023-06-26 00:00:00|Execution error for request 52262275. Reason: ESS-07033 Job logic indicated a system error occurred while executing an asynchronous java job for request 52262275. Job error is: one or more subbatches has failed to generate feed.|111975431
And I have a pattern file that is almost 200 lines long, with lines that look like this:
s~(.*Execution error for request [0-9]+. Reason: ESS-07033 Job logic indicated a system error occurred while executing an asynchronous java job for request [0-9]+. Job error is: one or more subbatches has failed to generate feed..*)~\1|959361986~p
t
s~(.*Execution error for request [0-9]+. Reason:.*)~\1|1735893446~p
The idea is that this list of patterns gets more and more broad at matching as it goes, with the broadest at the bottom
What I am trying to do is match a given pattern to a line in a file, and append the ID in sed expression to the end of the line. So:
this line in the data file:
2023-06-26 00:00:00|Execution error for request 52262275. Reason: ESS-07033 Job logic indicated a system error occurred while executing an asynchronous java job for request 52262275. Job error is: one or more subbatches has failed to generate feed.|111975431
Should match this expression:
s~(.*Execution error for request [0-9]+. Reason: ESS-07033 Job logic indicated a system error occurred while executing an asynchronous java job for request [0-9]+. Job error is: one or more subbatches has failed to generate feed..*)~\1|959361986~p
And ultimately output: 111975431|959361986
I am currently doing this as such:
sed -E -n -f ./workdir/2023-08-11/2023-06-26-00-00-00_2023-07-20-23-59-59/templates.sed ./workdir/2023-08-11/2023-06-26-00-00-00_2023-07-20-23-59-59/data.psv
| cut '-d|' -f3- |sort -t '|' -k1 -n
The idea being match the message with the regex, extract the number on the end and pair it with number in the sed expression.
This process works but takes forever! The data.psv
file I am currently working on is 3.2gb and took 30 minutes to process. While this is not typical, I also can't rule it out. In addition, I expect the number of patterns to increase with time.
So, how can I optimize this process? I an change the order of the columns in the input file (I can remove the date field as this stage, if it helps, and say match ^foo\|[0-9]+$
). I could split the file and fork a bunch of processes with xargs -P
and then rejoin the files.
I am open to suggestions.