I have a 84 million-line XML that I am processing with 'gawk' in Red Hat Linux. (OK, some people would recommend to use other tools rather than GAWK, but my XML doesn't have multiline tags or any other peculiarities that make GAWK not a good choice for the job.)
My concern is about performance.
My initial AWK script is something like this:
# Test_1.awk
BEGIN {FS = "<|:|=";}
{
if ($3 == "SubNetwork id")
{
# do something
}
}
END {
# print something
}
That makes 84 million string comparisons, once every line.
I noticed that "SubNetwork id" only appears when there are 4 fields in the line (NF=4), so I changed the script to make fewer string comparisons:
# Test_2.awk
BEGIN {FS = "<|:|=";}
{
if (NF == 4)
{
if ($3 == "SubNetwork id")
{
# do something
}
}
}
END {
# print something
}
I run it and saw that I was checking 'NF == 4' 84 million times (obvious) and '$3 == "SubNetwork id"' only 3 million times. Great, I had reduced the number of string comparisons, which I've always thought are more time-consuming than simple integer comparisons (NF is an integer, right?).
My surprise came when I tested both scripts for performance. Most of the times Test_1 was faster than Test_2. I run them many times to account for other processes that might be using CPU time, but overall my tests were run when the CPU was more or less 'idle'.
My brain tells me that 84 million integer comparisons plus 3 million string comparisons must be faster than 84 million string comparisons, but obviously something is wrong with my reasoning.
My XML looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<ConfigDataFile xmlns:un="specific.xsd" xmlns:xn="generic.xsd">
<configData dnPrefix="Undefined">
<xn:SubNetwork id="ROOT_1">
<xn:SubNetwork id="ROOT_2">
<xn:attributes>
...
</xn:attributes>
</xn:SubNetwork>
<xn:SubNetwork id="ID_1">
....
</xn:SubNetwork>
<xn:SubNetwork id="ID_2">
.....
</xn:SubNetwork>
</xn:SubNetwork>
</configData>
</ConfigDataFile>
Any help to understand this performance problem would be appreciated.
Thanks in advance.