Compute symmetric matrix of a large file size using awk

Question

I have an OD matrix (origin-destination matrix) written in list form, like this inputfile.csv:

"origin_id","destination_id","trips"
"0","0","20"
"0","1","12"
"0","2","8"
"1","0","23"
"1","1","50"
"1","2","6"
"2","1","9"
"2","2","33"

Which reads as:

There was 20 trips from origin_id=0 to destination_id=0,
There was 12 trips from origin_id=0 to destination_id=1,
And so on.

Please note that all the origin-destination pairs that have 0 trips, are not present in the input file (the elements of the matrix with zeros).

I need to compute the symmetric matrix as S=(OD+DO)/2, but the main problem is that the inputfile.csv is 31GB in size. I thought that a tool like awk could be a good solution, but I don't know how to proceed, because I have 32GB of RAM which is very similar to the input file size, and I always get out of memory.

The desired output should contain ideally only the diagonal and sub-diagonal elements, to avoid repeating values as the matrix is symmetrical. The desired final output of the provided input file is:

"origin_id","destination_id","trips"
"0","0","20"  
"0","1","17.5"
"0","2","4"
"1","1","50"
"1","2","7.5"
"2","2","33"

As complementary notes, I would like to add here that I can compute the symmetrical matrix with smaller input files using this script (symm.awk):

BEGIN {
    FS = OFS = ","
}

NR==1 { 
print; next 
}

{
    gsub("\"", "", $3)
    a[$1 FS $2] = $3
    b[$2 FS $1] = $3
}

END {
    for (i in a) {
        if (i in b) {
            print i, "\"" (a[i] + b[i]) / 2 "\""
        } 
        else {
            print i, "\"" (a[i]) / 2 "\""
        }
    }
}

and then get the diagonal and subdiagonal elements piping the result again to awk, like this:

awk -f symm.awk inputfile.csv |awk -F"\"" 'NR==1{print;next}$2<=$4{print $0}' > output.csv

But the ouptut it's not sorted, and also it's very ugly to obtain a result using an awk script and then piping the result again to awk.

I would appreciate some help in avoiding this "piping awk result into awk", and also it would be great to have some clues on how to deal with such a large input file size.

`awk -W version` gives me: `mawk 1.3.4 20200120` – ElTitoFranki Apr 05 '23 at 16:51 — ElTitoFranki, Apr 05 '23 at 16:51

markp-fuso · Answer 1 · 2023-04-06T00:30:14.867

1

One awk | sort idea:

head -1 inputfile.csv

awk '
BEGIN { FS=OFS="," }
NR==1 { next }
      { gsub(/"/,"")
        if   ($1 <= $2) a[$1,$2]+=$3
        else            a[$2,$1]+=$3
      }
END   { for (i in a) {
            split(i,ndx,SUBSEP)
            printf "\"%s\",\"%s\",\"%s\"\n", ndx[1], ndx[2],
                    a[ndx[1],ndx[2]] / (ndx[1]==ndx[2] ? 1 : 2 )
        }
      }
' inputfile.csv | sort -t, -k1,1V -k2,2V

This generates:

"origin_id","destination_id","trips"
"0","0","20"
"0","1","17.5"
"0","2","4"
"1","1","50"
"1","2","7.5"
"2","2","33"

edited Apr 06 '23 at 00:30

answered Apr 05 '23 at 17:02

markp-fuso

28,790
4
16
36

I installed GNU awk and used your suggested script for GNU awk, it works very well, the only issue is that I get out of memory (since the input file size is approximately the same size of my RAM). Do you have any idea of how to solve this issue? – ElTitoFranki Apr 05 '23 at 17:27
1

yeah, that's why I deleted the `GNU awk` answer ... running out of memory here, too; for some reason the `a[x][y]` construct uses up 10x (or more) memory than `a[x,y]`; you could use this (mawk) script with `gawk`, too ... but I don't see an easy way to use `gawk's` `PROCINFO["sorted_in"]` to do sorting within awk ... though there's probably a way to build a custom sort function – markp-fuso Apr 05 '23 at 17:31

dawg · Answer 2 · 2023-04-05T21:14:31.087

This pipe should work with a file of any size since only 2 lines are in memory at any time:

awk -F, 'FNR>1{
      gsub(/"/,"")
      printf("%s,%s\n", (($1<$2) ? $1 $2 : $2 $1),$0)}' file.csv | 
sort -t "," -k1 | 
awk 'BEGIN{FS=OFS=","}

function pp(){
    if (cnt==2) {
        split(lines[1],a,",")
        split(lines[2],b,",")
        lines[1]=a[1] OFS a[2] OFS (a[3]+b[3])/2
    }
    gsub(/,/,"\",\"", lines[1])
    print "\"" lines[1] "\""
    split("",lines)
    cnt=0
}
NR==1{lines[++cnt]=$2 OFS $3 OFS $4; prev=$1; next}
$1!=prev{ pp() }
{ 
    lines[++cnt]=$2 OFS $3 OFS $4
    prev=$1
}
END{pp()}'

With the example prints:

"0","0","20"
"0","1","17.5"
"0","2","8"
"1","1","50"
"1","2","7.5"
"2","2","33"

You can add the header with head -n 1 file.csv as a separate step.

It works by doing a Decorate / Sort / Undecorate approach to group a and b lines together.

Here is the Decorate / Sort step:

awk -F, 'FNR>1{
            gsub(/"/,"")
            printf("%s,%s\n", (($1<$2) ? $1 $2 : $2 $1),$0)}' file.csv | 
sort -t "," -k1

Prints:

00,0,0,20
01,0,1,12
01,1,0,23
02,0,2,8
11,1,1,50
12,1,2,6
12,2,1,9
22,2,2,33

Then detecting a group with a run of 1 or 2 lines with the same value in $1.

Your script is not giving me the symmetric matrix (OD+DO)/2 depending on the sparse matrix that I give it as input. For example (sorry for the formating):`"origin_id","destination_id","total_trips" "100037","414","16.414176129169583" "100037","70523","11.529717422041143" "100037","70525","29.478318090277075" "100037","70526","18.30672178261923" "100039","11551","10.832549328242418" "100039","186683","13.278608853974628" "100039","414","10.14694524348668" "100039","70525","18.222960273989443" "100042","275984","11.952502263453564"` gives an output that is not the symmetrical matrix — ElTitoFranki, Apr 06 '23 at 08:48

Compute symmetric matrix of a large file size using awk

2 Answers2