6

I have a 84 million-line XML that I am processing with 'gawk' in Red Hat Linux. (OK, some people would recommend to use other tools rather than GAWK, but my XML doesn't have multiline tags or any other peculiarities that make GAWK not a good choice for the job.)

My concern is about performance.

My initial AWK script is something like this:

# Test_1.awk
BEGIN {FS = "<|:|=";}
{
if ($3 == "SubNetwork id")
    {
    # do something
    }
}
END {
# print something
}

That makes 84 million string comparisons, once every line.

I noticed that "SubNetwork id" only appears when there are 4 fields in the line (NF=4), so I changed the script to make fewer string comparisons:

# Test_2.awk
BEGIN {FS = "<|:|=";}
{
if (NF == 4)
    {
    if ($3 == "SubNetwork id")
        {
        # do something
        }
    }
}
END {
# print something
}

I run it and saw that I was checking 'NF == 4' 84 million times (obvious) and '$3 == "SubNetwork id"' only 3 million times. Great, I had reduced the number of string comparisons, which I've always thought are more time-consuming than simple integer comparisons (NF is an integer, right?).

My surprise came when I tested both scripts for performance. Most of the times Test_1 was faster than Test_2. I run them many times to account for other processes that might be using CPU time, but overall my tests were run when the CPU was more or less 'idle'.

My brain tells me that 84 million integer comparisons plus 3 million string comparisons must be faster than 84 million string comparisons, but obviously something is wrong with my reasoning.

My XML looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<ConfigDataFile xmlns:un="specific.xsd" xmlns:xn="generic.xsd">
    <configData dnPrefix="Undefined">
        <xn:SubNetwork id="ROOT_1">
            <xn:SubNetwork id="ROOT_2">
                <xn:attributes>
                ...
                </xn:attributes>
            </xn:SubNetwork>
            <xn:SubNetwork id="ID_1">
            ....
            </xn:SubNetwork>
            <xn:SubNetwork id="ID_2">
            .....
            </xn:SubNetwork>
        </xn:SubNetwork>
    </configData>
</ConfigDataFile>

Any help to understand this performance problem would be appreciated.

Thanks in advance.

Mike Duke
  • 101
  • 1
  • 6

3 Answers3

4

I did more tests:

1- Generate a large file with some data

yes 'SomeSampleText SomeOtherText 33 1970 YetAnotherText 777 abc 1 AndSomeMore' | head -12000000 > SomeData.txt

The delimiter is whitespace!

2- Run these 6 tests, several times, and compute average time for each test. I did it on 3 diferent machines (with Red Hat Linux Enterprise 4)

time gawk 'BEGIN {a = 0;} {if ($5 == "YetAnotherText") a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} {if ($0 ~ /YetAnotherText/) a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} /YetAnotherText/ {a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} {if (NF == 9) a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} {if ($1 == "SomeSampleText") a ++;} END {print "a: " a;}' SomeData.txt
time gawk 'BEGIN {a = 0;} {if ($9 == "AndSomeMore") a ++;} END {print "a: " a;}' SomeData.txt

3- I got these results (numbers are seconds)

-- Machine 1
10.35
39.39
38.87
10.40
7.72
12.26

-- Machine 2
8.50
32.43
31.83
9.10
6.54
9.91

-- Machine 3
12.35
13.55
12.90
14.40
9.43
14.93

It looks like searching for pattern /YetAnotherText/ in tests 2 and 3 was very slow. Except for Machine 3...

4- Generate another large file with some data with different delimiters

yes "<SomeSampleText:SomeOtherText=33>1970<YetAnotherText:777=abc>1<AndSomeMore>" | head -12000000 > SomeData2.txt

5- Run 6 tests, changing the FS

time gawk 'BEGIN {FS = "<|:|=";} {if ($5 == "YetAnotherText") a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} {if ($0 ~ /YetAnotherText/) a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} /YetAnotherText/ {a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} {if (NF == 8) a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} {if ($2 == "SomeSampleText") a ++;} END {print "a: " a;}' SomeData2.txt
time gawk 'BEGIN {FS = "<|:|=";} {if ($8 == "AndSomeMore>") a ++;} END {print "a: " a;}' SomeData2.txt

6- I got these results (I only did it for Machine 3, sorry)

66.17
33.11
32.16
76.77
37.17
77.20

My conclusions (also see comments by @user31264):

  • It seems that parsing and splitting into fields is faster when there is one simple delimiter, instead of several delimiters.
  • Usually getting $N is faster than getting $M, where N < M
  • In some cases, searching for /pattern/ in the whole line is faster than comparing $N == "pattern", especially if N is not one of the first fields of the line
  • Getting NF can be slow because the line has to be parsed and fields calculated, and more so if there are several delimiters
Mike Duke
  • 101
  • 1
  • 6
1

Below is a simple test. The first line outputs 10,000,000 lines "a b c d" into the file a. awk is GNU Awk 4.1.3

[~] yes 'a b c d' | h -10000000 > a
[~] time awk '{if(NF==5)print("a")}' a
2.344u 0.012s 0:02.36 99.5%     0+0k 0+0io 0pf+0w
[~] time awk '{if(NF==5)print("a")}' a
2.364u 0.008s 0:02.37 99.5%     0+0k 0+0io 0pf+0w
[~] time awk '{if($4=="Hahaha")print("a")}' a
2.876u 0.024s 0:02.90 99.6%     0+0k 0+0io 0pf+0w
[~] time awk '{if($4=="Hahaha")print("a")}' a
2.880u 0.020s 0:02.90 100.0%    0+0k 0+0io 0pf+0w
[~] time awk '{if($1=="Hahaha")print("a")}' a
2.540u 0.020s 0:02.56 100.0%    0+0k 0+0io 0pf+0w
[~] time awk '{if($1=="Hahaha")print("a")}' a
2.404u 0.004s 0:02.41 99.5%     0+0k 0+0io 0pf+0w

As you can see, checking $1 is faster than checking $4, because in the former case AWK needs to parse the line only up to the first word. If you check only NF, AWK only counts words, which in my case was even faster, but in your case it might be slower to count words than to parse the input line up to the 3rd word.

Finally, we can speed up AWK like this:

[~] time awk '/Hahaha/{if($4=="Hahaha")print("a")}' a
1.376u 0.020s 0:01.40 99.2%     0+0k 0+0io 0pf+0w
[~] time awk '/Hahaha/{if($4=="Hahaha")print("a")}' a
1.372u 0.028s 0:01.40 99.2%     0+0k 0+0io 0pf+0w

because /Hahaha/ does not require any parsing.

If you add /SubNetwork id/ before the {, it may speed up things.

If you process only lines with "SuNetwork id" and ignore all the others, you may want to do

grep 'SubNetwork id' your_input_file | awk -f prog.awk

It would speed up things drastically, since grep is much faster than awk.

Finally, one more way to speed up awk is to use mawk, which is much faster than gawk. Unfortunately, sometimes it produces different results than gawk, so it should always be tested.

user31264
  • 6,557
  • 3
  • 26
  • 40
  • According to the manual, (https://www.gnu.org/software/gawk/manual/gawk.html#Fields), "When awk reads an input record, the record is automatically parsed or separated by the awk utility into chunks called fields." Isn't parsed like this [$1 -> (blank)], [$2 -> xn], [$3 -> SubNetwork id], [$4 -> "ID_1">], [NF is set to 4] before you start doing anything with the fields? From your answer I gather that $3 is only assigned/evaluated if $3 is referenced in the script, right? – Mike Duke Apr 20 '17 at 10:03
  • Another thing: /SubNetwork id/ doesn't require any parsing, but it is still a string comparison inside the whole line, which must surely be time-consuming. Is really the gawk parsing (splitting the line into fields) so bad for performance, worse even than a string comparison against the whole line? – Mike Duke Apr 20 '17 at 10:10
  • @MikeDuke - "When awk reads an input record, the record is automatically parsed or separated by the awk utility into chunks called fields." - this describes the semantics, not the internal behaviour. " Is really the gawk parsing (splitting the line into fields) so bad for performance, worse even than a string comparison against the whole line?" - did you see the test results? Yes, it is slow. By the way, pattern search is linear over the string length, and pretty fast (if you do it correctly). grep, in my test, took less than 0.1 second. – user31264 Apr 20 '17 at 11:07
  • Why shouldn't you test everything yourself? – user31264 Apr 20 '17 at 11:09
  • I was testing it on several machines and waiting to collect all the data. Please have a look at my reply, your answer was helpful. Thanks @user31264 – Mike Duke Apr 20 '17 at 15:29
  • @MikeDuke - one important thing you missed in tests is `/foo/{bar}` is faster than `{bar}` when `foo` occurs infrequently. – user31264 Apr 20 '17 at 16:25
  • (When foo is in every line, you don't get any improvement from /foo/{bar} .) – user31264 Apr 20 '17 at 16:49
  • Thanks @user31264, you are right. Fortunately, my real XML only has 28 occurrences of "SubNetwork id" in 84 million lines. – Mike Duke Apr 21 '17 at 06:45
0

another simple test

file is a 3.000.000 lines copy of your whole sample. Result is the a representative time after 3 run (for cache and other OS impact)

# time awk 'BEGIN{FS="[<:=]"}NF>=4{a++}END{print a+0}' YourFile
780100
real    0m1.89s user    0m1.74s sys     0m0.01s

# time awk 'BEGIN{FS="<|:|="}NF>=4{a++}END{print a+0}' YourFile
780100
real    0m2.00s user    0m1.91s sys     0m0.02s

# time awk 'BEGIN{FS="<|:|="}NF>=4&&/:SubNetwork/{a++}END{print a+0}' YourFile
780100
real    0m3.09s user    0m2.93s sys     0m0.02s

# time awk 'BEGIN{FS=":SubNetwork"}NF>=2{a++}END{print a+0}' YourFile
1560200
real    0m1.32s user    0m1.27s sys     0m0.02s

# time awk '/:SubNetwork/{a++}END{print a}' YourFile
1560200
real    0m3.23s user    0m3.06s sys     0m0.02s

show that if you use your :SubNetwork as field separator, it's the fastest.

Now for a following action you maybe need to split or reseparate the field via something like FS="<|:|=";$1=$1"";$0=$0""; ... your action ...; FS=":SubNetwork"

extra test as pre-filter

# time awk '$1 == "<xn:SubNetwork" || $1 == "<xn:Attributes" {a++}END{print a+0}' YourFile
780100
real    0m1.29s user    0m1.20s sys     0m0.03s
NeronLeVelu
  • 9,908
  • 1
  • 23
  • 43
  • I can't change the field separator because in my real script I need to find more strings, not only "SubNetwork id". I simplified my real script into two smaller scripts (the ones in my original post) in order to test performance, and it surprised me that comparing $3 == "SubNetwork id" is slower than comparing NF == 4 – Mike Duke Apr 20 '17 at 10:08
  • i use `:SubNetwork` as sample to the test that only catch line via NF. idea is to work like you did, 1) pre filter fast 2) detail filter heavy. so see if there is a common part in all your constaint that can *easily* separate the line. – NeronLeVelu Apr 20 '17 at 10:59
  • added an extra test that could help a bit with your last comment – NeronLeVelu Apr 20 '17 at 11:07
  • It's clear that choosing the appropiate FS can make a significant impact @NeronLeVelu – Mike Duke Apr 21 '17 at 06:50