Compute symmetric matrix of a large file with awk

Question

I have an OD matrix (origin-destination matrix) written in list form, like this inputfile.csv:

"origin_id","destination_id","trips"
"0","0","20"
"0","1","12"
"0","2","8"
"1","0","23"
"1","1","50"
"1","2","6"
"2","1","9"
"2","2","33"

Which reads as:

There was 20 trips from origin_id=0 to destination_id=0,
There was 12 trips from origin_id=0 to destination_id=1,
And so on.

All the origin-destination pairs that have 0 trips, are not present in the input file (the elements of the matrix with zeros).

I need to compute the symmetric matrix as S=(OD+DO)/2, but the main problem is that the inputfile.csv is 30GB in size. I thought that a tool like awk could be a good solution, but I don't know how to proceed. I think that the pseudo algorithm should be something like this:

Read line, keep origin1,destination2 and trips12 (where origin1,destination2 can be any origin_id or desitnation_id).
- If destination2,origin1 is present in the file, keep trips21 and write: origin1,destination2 --> (trips12+trips21)/2
- Else write: origin1,destination1: (trips12)/2

Same for every pair (origin-destination).

I think that awk can be great for this task, but I am open to use any suggested tool (python, perl, octave, etc...)

awk -F"\"" '{a[$2$4]==$6;if $4$2 ...}' inputfile.csv

No clue how to do it...

Desired output:

"origin_id","destination_id","trips"
"0","0","20"  
"0","1","17.5"
"0","2","4"
"1","1","50"
"1","2","7.5"
"2","2","33"

score 2 · Accepted Answer · answered Apr 05 '23 at 12:43

2

How much RAM do you have? Would this approach work?

awk 'BEGIN {
    FS = OFS = ","
}

NR == 1 {
    print
}

{
    gsub("\"", "", $3)
    a[$1 FS $2] = $3
    b[$2 FS $1] = $3
}

END {
    for (i in a) {
        if (i in b) {
            print i, "\"" (a[i] + b[i]) / 2 "\""
        }
    }
}' inputfile.csv
"origin_id","destination_id","trips"
"0","2","12.5"
"0","1","8.5"
"1","1","22"
"0","0","12"
"1","0","8.5"
"2","0","12.5"

answered Apr 05 '23 at 12:43

jared_mamrot

22,354
4
21
46

1

Thank for your answer! I have 32GB of RAM, for the moment I am testing your script in smaller-toy-matrixs. Your answer works when all the matrix elements are non-zero, if there are missing elements it does skip them. I modified a little bit your script – ElTitoFranki Apr 05 '23 at 13:43

ElTitoFranki · Answer 2 · 2023-04-05T14:16:43.177

1

Adapting the answer of @jared_mamrot to my needs, I am using this (note the else):

awk BEGIN {
    FS = OFS = ","
}

NR==1 { 
print; next 
}

{
    gsub("\"", "", $3)
    a[$1 FS $2] = $3
    b[$2 FS $1] = $3
}

END {
    for (i in a) {
        if (i in b) {
            print i, "\"" (a[i] + b[i]) / 2 "\""
        } 
        else {
            print i, "\"" (a[i]) / 2 "\""
        }
    }
}' inputfile.csv

edited Apr 05 '23 at 14:16

answered Apr 05 '23 at 13:45

ElTitoFranki

375
1
7

Compute symmetric matrix of a large file with awk

2 Answers2