2

The problem I have is related to a previous post. Now what I have right now is:

  1. One 9-million-rows CSV file containing fields arranged randomly like:

    192.168.12.23,62,LOCAL,341993,23/10/2012
    192.168.12.25,11,MONLOCAL$MONREMOTE,33222$56,22/10/2012$18/10/2012
    192.168.12.678,14,MONLOCAL,341993,22/10/2012
    
    192.168.12.83,18, , ,
    192.168.12.21,49,LOCAL$REMOTE,19316$15253,22/10/2012$22/10/2012
    192.168.12.79,52,REMOTE,1180306134,19/10/2012
    192.168.12.41,44,MONLOCAL$MONREMOTE,1865871$383666,22/10/2012$22/10/2012
    192.168.12.29,23,MONREMOTE,151145,18/10/2012
    

    Where as you probably already observed, there are 2 field separators, , and $:

    • field 1 = IP Address

    • field 2 = An unique ID

    • field 3 = connection that IP has used (e.g LOCAL, REMOTE, MONLOCAL, or MONREMOTE)

    • field 4 = can be also the connection IP has used (e.g LOCAL, REMOTE, MONLOCAL, or MONREMOTE) or the value related to field 3

    • field 5 = if field 4 is LOCAL or REMOTE or MONLOCAL or MONREMOTE, then field 5 represent the value of field 3 else it represent the timestamp related to field 3

    • field 6 = if field 4 is LOCAL or REMOTE or MONLOCAL or MONREMOTE, then field 6 represent the value of field 4 else it will not exist.

    • field 7 = if field 4 is LOCAL or REMOTE or MONLOCAL or MONREMOTE, then field 7 represent the timestamp of field 3 else it will not exist.

    • field 8 = if field 4 is LOCAL or REMOTE or MONLOCAL or MONREMOTE, then field 7 represent the timestamp of field 4 else it will not exist.

  2. I need to process the file above and output the following format, having the fields always in following order (10 fields):

    IP,ID,MONLOCAL_value,MONLOCAL_timestamp,LOCAL_value,LOCAL_timestamp,MONREMOTE_value,MONREMOTE_timestamp,REMOTE_value,REMOTE_timestamp like:

    192.168.12.23,62, , ,341993,23/10/2012, , , , 
    192.168.12.25,11,33222,22/10/2012, , , , ,56,18/10/2012
    192.168.12.678,14,341993,22/10/2012, , , , , ,  
    192.168.12.83,18, , , , , , , , 
    192.168.12.21,49, , ,19316,22/10/2012, , ,15253,22/10/2012
    192.168.12.79,52, , , , , , ,1180306134,19/10/2012
    192.168.12.41,44,1865871,22/10/2012, , ,383666,22/10/2012, , 
    192.168.12.29,23, , , , ,151145,18/10/2012, , 
    
  3. I have the following script to process the file from above:

    nawk 'BEGIN {
        while (getline < "'"$data"'" > 0)
        {
        {FS = "[,,$]"; OFS=","}
        split($0,flds)
           {if ($4 ~ /LOCAL|REMOTE|MONLOCAL|MONREMOTE/) {
                if ($3 ~ /MONLOCAL/) {
                        MONREMOTE_time=flds[8];
                        MONREMOTE_value=flds[6];
                        MONLOCAL_time=flds[7];
                        MONLOCAL_value=flds[5]; }
                if ($3 ~ /MONREMOTE/) {
                        MONREMOTE_time=flds[7];
                        MONREMOTE_value=flds[5];
                        REMOTE_value=flds[6];
                        REMOTE_time=flds[8]; }
                if ($3 ~ /REMOTE/) {
                        REMOTE_value=flds[5];
                        REMOTE_time=flds[7];
                        LOCAL_value=flds[6];
                        LOCAL_time=flds[8]; }
        } else {
                if($3 ~ /MONLOCAL/) {
                        MONLOCAL_value=flds[4];
                        MONLOCAL_time=flds[5]; }
                if ($3 ~ /MONREMOTE/) {
                        MONREMOTE_value=flds[4];
                        MONREMOTE_time=flds[5]; }
                if ($3 ~ /LOCAL/) {
                        LOCAL_value=flds[4];
                        LOCAL_time=flds[5]; }
                if ($3 ~ /REMOTE/) {
                        REMOTE_value=flds[4];
                        REMOTE_time=flds[5]; }
        }
      }
      {print MONLOCAL_value",MONLOCAL_time,LOCAL_value,LOCAL_time,MONREMOTE_value,MONREMOTE_time,REMOTE_value,REMOTE_time;}
     }
    }'
    
  4. The bad part here is that as you can observe below, the output is not as expected since I'm unable to clear the array values after each line is read, or find a solution to charge the array elements dynamically for each line:

    4915779000211,62, , ,341993,23/10/2012, , , , ,
    4915779000212,11,33222,22/10/2012,341993,23/10/2012,56,18/10/2012, , 
    491639000591,14,341993,22/10/2012,341993,23/10/2012, , , ,  
    491779001768,18,319307448,18/10/2012,19316,22/10/2012,383666,22/10/2012,1180306134,19/10/2012
    4915779000213,49,3849259,05/10/2012,19316,22/10/2012,56,18/10/2012,15253,22/10/2012
    491779000758,52,9356828,08/10/2012,19316,22/10/2012,56,18/10/2012,1180306134,19/10/2012
    4915779000214,44,1865871,22/10/2012,19316,22/10/2012,383666,22/10/2012,1180306134,19/10/2012
    491639000221,23,319307448,18/10/2012,19316,22/10/2012,151145,18/10/2012,1180306134,19/10/2012
    

So, if you guys have an idea how to make this awk function to work and give the expected output, I would be grateful.

Community
  • 1
  • 1
Asgard
  • 602
  • 1
  • 7
  • 18
  • 2
    It's not obvious why you're doing this in a loop inside the BEGIN block. Why not use std implied loop of awk like `awk '{ if (logic) { ...} else if (more logic) { .... } else { .... } }' file >outfile` Also to reset an array, you should be able to `delete flds` and then reassign key/values as needed. Good luck. – shellter Jan 23 '13 at 05:27

1 Answers1

1

With some modification you can use something similar to the following:

while read line; do 
  if [[ "$line" == *\$* ]]; then 
    echo $line | awk -F',|\$' '{print $1,$2,$5,$6,$3}' >> newfile;
  else
    echo $line | awk -F',' '{print $1,$2,$3,,$4,,$5}' >> newfile;
  fi
done < "/path/to/your/file"
Chris Montanaro
  • 16,948
  • 4
  • 20
  • 29
  • I already tried that, but I have 11 million lines to parse in a few hours (~ 500 lines/sec), It took 6 hours to process ~ 100 000 lines which is not so well. I need to remain inside the nawk function in order to keep the processing at a high speed. Anyway, thanks for advice – Asgard Jan 30 '13 at 12:01