Merging rows in a file | Performance Improvement

Question

I have a file in which I have to merge 2 rows on the basis of:
- Common sessionID
- Immediate next matching pattern (GX with QG)

file1:

session=001,field01,name=GX1_TRANSACTION,field03,field04    
session=001,field91,name=QG    
session=001,field01,name=GX2_TRANSACTION,field03,field04    
session=001,field92,name=QG    

session=004,field01,name=GX1_TRANSACTION,field03,field04    
session=002,field01,name=GX1_TRANSACTION,field03,field04    
session=002,field01,name=GX2_TRANSACTION,field03,field04    
session=002,field92,name=QG    

session=003,field91,name=QG    
session=003,field01,name=GX2_TRANSACTION,field03,field04    
session=003,field92,name=QG    

session=004,field91,name=QG    
session=004,field01,name=GX2_TRANSACTION,field03,field04    
session=004,field92,name=QG

I have created an awk (I am new and learnt awk only from This portal only) which created my desired output.

Output1

session=001,field01,name=GX1_TRANSACTION,field03,field04,session=001,field91,name=QG
session=001,field01,name=GX2_TRANSACTION,field03,field04,session=001,field92,name=QG
session=002,field01,name=GX1_TRANSACTION,field03,field04,NOMATCH-QG
session=002,field01,name=GX2_TRANSACTION,field03,field04,session=002,field92,name=QG
session=003,field01,name=GX2_TRANSACTION,field03,field04,session=003,field92,name=QG
session=004,field01,name=GX1_TRANSACTION,field03,field04,session=004,field91,name=QG
session=004,field01,name=GX2_TRANSACTION,field03,field04,session=004,field92,name=QG

Output2: Pending

session=003,field91,name=QG

Awk:

{
    if($0~/name=GX1_TRANSACTION/ || $0~/GX2_TRANSACTION/) {
        if($1 in ccr)
            print ccr[$1]",NOMATCH-QG";
        ccr[$1]=$0;
    }
    if($0~/name=QG/) {
        if($1 in ccr) {
            print ccr[$1]","$0;
            delete ccr[$1];
        }
        else {
            print $0",NOUSER" >> Pending
        }
    }
}
END {
    for (i in ccr)
        print ccr[i]",NOMATCH-QG"
}

Command:

awk -F"," -v Pending=t -f a.awk file1

But Issue is my "file1" is really big, So I want to improve the performance of this script. Is their any way by which I can improve its performance?

I cannot see any obvious places for improvement. If you have `gawk` version 4, you can try running it using `gawk --profile` and a file `awkprof.out` is generated with profiling information. Also you could try to port the program to `perl` to see if `perl` can give a faster solution.. — Håkon Hægland, Dec 29 '13 at 06:58

score 3 · Answer 1 · answered Dec 29 '13 at 08:07

There are a couple of changes that may lead to small improvements in speed, and if not may give you some ideas for future awk scripts.

Don't "manually" test every line if you don't have to - raise the name= tests to the main awk loop. Currently your script checks $0 up to three times per line for a name= match.
Since you're using , as the FS, test the corresponding field ($3) instead of $0. It only saves a few leading chars of pattern matching in your example data.

Here's a refactored a.awk:

$3~/name=GX[12]_TRANSACTION/ {
    if($1 in ccr)
        print ccr[$1]",NOMATCH-QG";
    ccr[$1]=$0;
}

$3~/name=QG/ {
    if($1 in ccr) {
        print ccr[$1]","$0;
        delete ccr[$1];
    }
    else {
        print $0",NOUSER"  >> Pending
    }
}

END { for (i in ccr) print ccr[i]",NOMATCH-QG" }

I've also condensed the GX pattern match to one regex. I get the same output as your example.

score 1 · Answer 2 · answered Dec 29 '13 at 12:56

In any program, IO (e.g. print statements) is usually the most real-time intensive operation. In awk there's an operation that's even slower, though, and that's string concatenation. Because awk doesn't require you to pre-allocate memory for strings, the memory gets allocated dynamically so then when you increase the length of a string, it must get dynamically re-allocated. So, you can speed up your program by removing the string concatenations, e.g. for all those hard-coded ","s you're printing instead of just setting/using the OFS.

I haven't really thought about the logic of your overall approach but there's a couple of other tweaks you could try:

BEGIN{ FS=OFS="," }

NF {
    if ($3 ~ /name=GX[12]_TRANSACTION/) {
        if($1 in ccr) {
            print ccr[$1], "NOMATCH-QG"
        }
        ccr[$1]=$0
    }
    else {
        if($1 in ccr) {
            print ccr[$1], $0
            delete ccr[$1]
        }
        else {
            print $0, "NOUSER" >> Pending
        }
    }
}

END {
    for (i in ccr)
        print ccr[i], "NOMATCH-QG"
}

Note that by setting FS in the script you no longer need to use -F"," on the command line.

Are you sure you want >> instead of > on the print to "Pending"? Those 2 constructs don't mean the same in awk as they do in shell.

Even I also like the string concatenation stuff that you explained.. Not sure how much it will help in performance but I'll implement this for sure.. thanks :) — Vipin Choudhary, Dec 29 '13 at 14:41
Hello Ed again, interestingly when I implemented your changes then processing time increased by 4 seconds (on 290MB file). I tried this 3 times.. and when i reverted the changes then again it decreased by 4 seconds. So your changes are somehow increasing the time. — Vipin Choudhary, Dec 29 '13 at 14:55
`>` and `>>` both "append in a file", one just zaps the original file first if it existed. I can't imagine what would cause the time to increase, you could try implementing the changes incrementally and testing to see where the time changes if you care. Just be aware that caching of results can have an impact so try every solution a few times and average the time stats for each. — Ed Morton, Dec 29 '13 at 15:15

Merging rows in a file | Performance Improvement

2 Answers2