I have an OD matrix (origin-destination matrix) written in list form, like this inputfile.csv
:
"origin_id","destination_id","trips"
"0","0","20"
"0","1","12"
"0","2","8"
"1","0","23"
"1","1","50"
"1","2","6"
"2","1","9"
"2","2","33"
Which reads as:
- There was 20 trips from origin_id=0 to destination_id=0,
- There was 12 trips from origin_id=0 to destination_id=1,
- And so on.
Please note that all the origin-destination pairs that have 0 trips, are not present in the input file (the elements of the matrix with zeros).
I need to compute the symmetric matrix as S=(OD+DO)/2
, but the main problem is that the inputfile.csv
is 31GB in size. I thought that a tool like awk could be a good solution, but I don't know how to proceed, because I have 32GB of RAM which is very similar to the input file size, and I always get out of memory.
The desired output should contain ideally only the diagonal and sub-diagonal elements, to avoid repeating values as the matrix is symmetrical. The desired final output of the provided input file is:
"origin_id","destination_id","trips"
"0","0","20"
"0","1","17.5"
"0","2","4"
"1","1","50"
"1","2","7.5"
"2","2","33"
As complementary notes, I would like to add here that I can compute the symmetrical matrix with smaller input files using this script (symm.awk
):
BEGIN {
FS = OFS = ","
}
NR==1 {
print; next
}
{
gsub("\"", "", $3)
a[$1 FS $2] = $3
b[$2 FS $1] = $3
}
END {
for (i in a) {
if (i in b) {
print i, "\"" (a[i] + b[i]) / 2 "\""
}
else {
print i, "\"" (a[i]) / 2 "\""
}
}
}
and then get the diagonal and subdiagonal elements piping the result again to awk, like this:
awk -f symm.awk inputfile.csv |awk -F"\"" 'NR==1{print;next}$2<=$4{print $0}' > output.csv
But the ouptut it's not sorted, and also it's very ugly to obtain a result using an awk
script and then piping the result again to awk.
I would appreciate some help in avoiding this "piping awk result into awk", and also it would be great to have some clues on how to deal with such a large input file size.