The problem I have is related to a previous post. Now what I have right now is:
One 9-million-rows CSV file containing fields arranged randomly like:
192.168.12.23,62,LOCAL,341993,23/10/2012 192.168.12.25,11,MONLOCAL$MONREMOTE,33222$56,22/10/2012$18/10/2012 192.168.12.678,14,MONLOCAL,341993,22/10/2012 192.168.12.83,18, , , 192.168.12.21,49,LOCAL$REMOTE,19316$15253,22/10/2012$22/10/2012 192.168.12.79,52,REMOTE,1180306134,19/10/2012 192.168.12.41,44,MONLOCAL$MONREMOTE,1865871$383666,22/10/2012$22/10/2012 192.168.12.29,23,MONREMOTE,151145,18/10/2012
Where as you probably already observed, there are 2 field separators,
,
and$
:field 1 = IP Address
field 2 = An unique ID
field 3 = connection that IP has used (e.g LOCAL, REMOTE, MONLOCAL, or MONREMOTE)
field 4 = can be also the connection IP has used (e.g LOCAL, REMOTE, MONLOCAL, or MONREMOTE) or the value related to field 3
field 5 = if field 4 is LOCAL or REMOTE or MONLOCAL or MONREMOTE, then field 5 represent the value of field 3 else it represent the timestamp related to field 3
field 6 = if field 4 is LOCAL or REMOTE or MONLOCAL or MONREMOTE, then field 6 represent the value of field 4 else it will not exist.
field 7 = if field 4 is LOCAL or REMOTE or MONLOCAL or MONREMOTE, then field 7 represent the timestamp of field 3 else it will not exist.
field 8 = if field 4 is LOCAL or REMOTE or MONLOCAL or MONREMOTE, then field 7 represent the timestamp of field 4 else it will not exist.
I need to process the file above and output the following format, having the fields always in following order (10 fields):
IP,ID,MONLOCAL_value,MONLOCAL_timestamp,LOCAL_value,LOCAL_timestamp,MONREMOTE_value,MONREMOTE_timestamp,REMOTE_value,REMOTE_timestamp like:
192.168.12.23,62, , ,341993,23/10/2012, , , , 192.168.12.25,11,33222,22/10/2012, , , , ,56,18/10/2012 192.168.12.678,14,341993,22/10/2012, , , , , , 192.168.12.83,18, , , , , , , , 192.168.12.21,49, , ,19316,22/10/2012, , ,15253,22/10/2012 192.168.12.79,52, , , , , , ,1180306134,19/10/2012 192.168.12.41,44,1865871,22/10/2012, , ,383666,22/10/2012, , 192.168.12.29,23, , , , ,151145,18/10/2012, ,
I have the following script to process the file from above:
nawk 'BEGIN { while (getline < "'"$data"'" > 0) { {FS = "[,,$]"; OFS=","} split($0,flds) {if ($4 ~ /LOCAL|REMOTE|MONLOCAL|MONREMOTE/) { if ($3 ~ /MONLOCAL/) { MONREMOTE_time=flds[8]; MONREMOTE_value=flds[6]; MONLOCAL_time=flds[7]; MONLOCAL_value=flds[5]; } if ($3 ~ /MONREMOTE/) { MONREMOTE_time=flds[7]; MONREMOTE_value=flds[5]; REMOTE_value=flds[6]; REMOTE_time=flds[8]; } if ($3 ~ /REMOTE/) { REMOTE_value=flds[5]; REMOTE_time=flds[7]; LOCAL_value=flds[6]; LOCAL_time=flds[8]; } } else { if($3 ~ /MONLOCAL/) { MONLOCAL_value=flds[4]; MONLOCAL_time=flds[5]; } if ($3 ~ /MONREMOTE/) { MONREMOTE_value=flds[4]; MONREMOTE_time=flds[5]; } if ($3 ~ /LOCAL/) { LOCAL_value=flds[4]; LOCAL_time=flds[5]; } if ($3 ~ /REMOTE/) { REMOTE_value=flds[4]; REMOTE_time=flds[5]; } } } {print MONLOCAL_value",MONLOCAL_time,LOCAL_value,LOCAL_time,MONREMOTE_value,MONREMOTE_time,REMOTE_value,REMOTE_time;} } }'
The bad part here is that as you can observe below, the output is not as expected since I'm unable to clear the array values after each line is read, or find a solution to charge the array elements dynamically for each line:
4915779000211,62, , ,341993,23/10/2012, , , , , 4915779000212,11,33222,22/10/2012,341993,23/10/2012,56,18/10/2012, , 491639000591,14,341993,22/10/2012,341993,23/10/2012, , , , 491779001768,18,319307448,18/10/2012,19316,22/10/2012,383666,22/10/2012,1180306134,19/10/2012 4915779000213,49,3849259,05/10/2012,19316,22/10/2012,56,18/10/2012,15253,22/10/2012 491779000758,52,9356828,08/10/2012,19316,22/10/2012,56,18/10/2012,1180306134,19/10/2012 4915779000214,44,1865871,22/10/2012,19316,22/10/2012,383666,22/10/2012,1180306134,19/10/2012 491639000221,23,319307448,18/10/2012,19316,22/10/2012,151145,18/10/2012,1180306134,19/10/2012
So, if you guys have an idea how to make this awk function to work and give the expected output, I would be grateful.