I have an OD matrix (origin-destination matrix) written in list form, like this inputfile.csv
:
"origin_id","destination_id","trips"
"0","0","20"
"0","1","12"
"0","2","8"
"1","0","23"
"1","1","50"
"1","2","6"
"2","1","9"
"2","2","33"
Which reads as:
- There was 20 trips from origin_id=0 to destination_id=0,
- There was 12 trips from origin_id=0 to destination_id=1,
- And so on.
All the origin-destination pairs that have 0 trips, are not present in the input file (the elements of the matrix with zeros).
I need to compute the symmetric matrix as S=(OD+DO)/2
, but the main problem is that the inputfile.csv
is 30GB in size. I thought that a tool like awk could be a good solution, but I don't know how to proceed. I think that the pseudo algorithm should be something like this:
- Read line, keep
origin1,destination2
andtrips12
(whereorigin1,destination2
can be anyorigin_id
ordesitnation_id
).- If
destination2,origin1
is present in the file, keeptrips21
and write:origin1,destination2 --> (trips12+trips21)/2
- Else write:
origin1,destination1: (trips12)/2
- If
- Same for every pair (origin-destination).
I think that awk can be great for this task, but I am open to use any suggested tool (python, perl, octave, etc...)
awk -F"\"" '{a[$2$4]==$6;if $4$2 ...}' inputfile.csv
No clue how to do it...
Desired output:
"origin_id","destination_id","trips"
"0","0","20"
"0","1","17.5"
"0","2","4"
"1","1","50"
"1","2","7.5"
"2","2","33"