1

I have a large .csv file to to process and my elements are arranged randomly like this:

xxxxxx,xx,MLOCAL,MREMOTE,33222,56,22/10/2012,18/10/2012 xxxxxx,xx,MREMOTE,MLOCAL,33222,56,22/10/2012,18/10/2012 xxxxxx,xx,MLOCAL,341993,22/10/2012
xxxxxx,xx,MREMOTE,9356828,08/10/2012
xxxxxx,xx,LOCAL,REMOTE,19316,15253,22/10/2012,22/10/2012
xxxxxx,xx,REMOTE,LOCAL,1865871,383666,22/10/2012,22/10/2012
xxxxxx,xx,REMOTE,1180306134,19/10/2012

where fields LOCAL, REMOTE, MLOCAL or MREMOTE are displayed like:

  1. when they are displayed as pairs (LOCAL/REMOTE) if 3rd field is MLOCAL, and 4th field is MREMOTE, then 5th and 7th field represent the value and date of MLOCAL, and 6th and 8th represent the value and date of MREMOTE
  2. when they are displayed as single (only LOCAL or only REMOTE) then the 4th and 5th fields represent the value and date of field 3.

Now, I have split these rows using:

nawk 'BEGIN{

while (getline < "'"$filedata"'")
split($0,ft,",");
name=ft[1];
ID=ft[2]
 ?=ft[3]
 ?=ft[4]
....................

but because I can't find a pattern for the 3rd and 4th field I'm pretty stuck to continue to assign var names for each of the array elements in order to use them for further processing.

Now, I tried to use "case" statement but isn't working for awk or nawk (only in gawk is working as expected). I also tried this:

if ( ft[3] == "MLOCAL" && ft[4]!= "MREMOTE" )
{
        MLOCAL=ft[3];
        MLOCAL_qty=ft[4];
        MLOCAL_TIMESTAMP=ft[5];
}
else if ( ft[3] == MLOCAL && ft[4] == MREMOTE )
{
        MLOCAL=ft[3];
        MREMOTE=ft[4];
        MOCAL_qty=ft[5];
        MREMOTE_qty=ft[6];
        MOCAL_TIMESTAMP=ft[7];
        MREMOTE_TIMESTAMP=ft[8];
}
else if ( ft[3] == MREMOTE && ft[4] != MOCAL )
{
        MREMOTE=ft[3];
        MREMOTE_qty=ft[4];
        MREMOTE_TIMESTAMP=ft[5];
..........................................

but it's not working as well.

So, if you have any idea how to handle this, I would be grateful to give me a hint in order to be able to find a pattern in order to cover all the possible situations from above.

EDIT

I don't know how to thank you for all this help. Now, what I have to do is more complex than I wrote above, I'll try to describe as simple as I can otherwise I'll make you guys pretty confused. My output should be like following:

NAME,UNIQUE_ID,VOLUME_ALOCATED,MLOCAL_VALUE,MLOCAL_TIMESTMP,MLOCAL_limit,LOCAL_VALUE,LOCAL_TIMESTAMP,LOCAL_limit,MREMOTE_VALUE,MREMOTE_TIMESTAMP,REMOTE_VALUE,REMOTE_TIMESTAMP

(where MLOCAL_limit and LOCAL_limit are a subtract result between VOLUME_ALOCATED and MLOCAL_VALUE or LOCAL_VALUE)

So, in my output file, fields position should be arranged like: 4th field =MLOCAL_VALUE,5th field =MLOCAL_TIMESTMP,7th field=LOCAL_VALUE, 8th field=LOCAL_TIMESTAMP,10th field=MREMOTE_VALUE,11th field=MREMOTE_TIMESTAMP,12th field=REMOTE_VALUE,13th field=REMOTE_TIMESTAMP

Now, an example would be this: for the following input: name,ID,VOLUME_ALLOCATED,MLOCAL,MREMOTE,33222,56,22/10/2012,18/10/2012

name,ID,VOLUME_ALLOCATED,REMOTE,234455,19/12/2012

I should process this line and the output should be this:

name,ID,VOLUME_ALLOCATED,33222,22/10/2012,MLOCAL_LIMIT, ,,,56,18/10/2012,,

7th ,8th, 9th,12th, and 13th fields are empty because there is no info related to: LOCAL_VALUE,LOCAL_TIMESTAMP,LOCAL_limit,REMOTE_VALUE, and REMOTE_TIMESTAMP

OR

name,ID,VOLUME_ALLOCATED,,,,,,,,,234455,9/12/2012

4th,5th,6th,7th,8th,9th,10thand ,11th, fields should be empty values because there is no info about: MLOCAL_VALUE,MLOCAL_TIMESTAMP,MLOCAL_LIMIT,LOCAL_VALUE,LOCAL_TIMESTAMP,LOCAL_LIMIT,MREMOTE_VALUE,MREMOTE_TIMESTAMP

VOLUME_ALLOCATED is retrieved from other csv file (called "info.csv") based on the ID field which is processed earlier in the script like:

info.csv

VOLUME_ALLOCATED,ID,CLIENT 5242881,64,subscriber 567743,24,visitor

data.csv

NAME,64,MLOCAL,341993,23/10/2012 NAME,24,LOCAL$REMOTE,2347$4324,19/12/2012$18/12/2012

Now, my code is this:

    #! /usr/bin/bash

input="info.csv"
filedata="data.csv"
outfile="out"

nawk 'BEGIN{
while (getline < "'"$input"'")
{
split($0,ft,",");
volume=ft[1];
id=ft[2];
client=ft[3];

key=id;
volumeArr[key]=volume;
clientArr[key]=client;
}
close("'"$input"'");

while (getline < "'"$filedata"'")
{
gsub(/\$/,","); # substitute the $ separator with comma
split($0,ft,",");
volume=volumeArr[id]; # Get the volume from the volumeArr, using "id" as key
segment=clientArr[id]; # Get the client mode from the clientArr, using "id" as key
NAME=ft[1];
id=ft[2];

here I'm stuck, I can't find the right way to set the rest of the fields since I don't know how to handle the 3rd and 4th fields.

? =ft[3];
? =ft[4];

Sorry, if I make you pretty confused but this is my current situation right now. Thanks

Asgard
  • 602
  • 1
  • 7
  • 18
  • Divide your development, and just process 1 type of record until that works, then add an `else if` for the next one. As you seem to know, you want to process the more complex tests first. But why are you creating arrs? `awk` is designed to proc field numbers just as you describe in your text. Although your 2nd example looks pretty close to your problem description. To simplify your code, something like `awk -F, '{if ( $3=="MLOCL" && $4 == "MRE" ) { ...} else if($3=="MLOCL") {print $3 "\t" $4 "\t" $5}}'` even if you have to pipe that result to another `awk` program to create your final output. – shellter Jan 18 '13 at 05:38
  • 1
    dont jump out of awk to access the value of a shell variable as that's clumsy and error prone, and don't use getline like that as it's unnecessary and the wrong way to call getline. – Ed Morton Jan 18 '13 at 06:03
  • 1
    Your second code fragment has some occurrences of MREMOTE and MLOCAL (when it is not MOCAL!) in quotes and some not in quotes. (a) Your spelling must be consistent, and (b) you need the names in quotes uniformly. – Jonathan Leffler Jan 18 '13 at 06:29
  • 1
    Where does `VOLUME_ALLOCATED` come from? Are MLOCAL and LOCAL 2 different things or are you just sometimes dropping the M? Ditto for MREMOTE and REMOTE. – Ed Morton Jan 18 '13 at 14:40
  • I just edited my post to show the complete input. YES, LOCAL and MLOCAL are 2 different things. LOCAL refers to volume consumed by VISITOR client and MLOCAL is refering only for the SUBSCRIBER CLIENT. SAME for REMOTE and MREMOTE. REMOTE -> VISITOR CLIENT connecting remote, MREMOTE -> SUBSCRIBER CLIENT connecting REMOTE – Asgard Jan 18 '13 at 15:28
  • SO, guys, have any idea how should I solve this? Thanks – Asgard Jan 21 '13 at 07:37

1 Answers1

7

You didn't provide the expected output from your sample input but here's a start to show how to get the values for the 2 different formats of input line:

$ cat tst.awk
BEGIN{ FS=","; OFS="\t" }
{
   delete value       # or use split("",value) if your awk cant delete arrays
   if ($4 ~ /LOCAL|REMOTE/) {
      value[$3] = $5
      date[$3]  = $7
      value[$4] = $6
      date[$4]  = $8
   }
   else {
      value[$3] = $4
      date[$3]  = $5
   }

   print
   for (type in value) {
      printf "%15s%15s%15s\n", type, value[type], date[type]
   }
}
$ awk -f tst.awk file
xxxxxx,xx,MLOCAL,MREMOTE,33222,56,22/10/2012,18/10/2012
        MREMOTE             56     18/10/2012
         MLOCAL          33222     22/10/2012
xxxxxx,xx,MREMOTE,MLOCAL,33222,56,22/10/2012,18/10/2012
        MREMOTE          33222     22/10/2012
         MLOCAL             56     18/10/2012
xxxxxx,xx,MLOCAL,*341993,22/10/2012*
         MLOCAL        *341993    22/10/2012*
xxxxxx,xx,MREMOTE,9356828,08/10/2012
        MREMOTE        9356828     08/10/2012
xxxxxx,xx,LOCAL,REMOTE,19316,15253,22/10/2012,22/10/2012
         REMOTE          15253     22/10/2012
          LOCAL          19316     22/10/2012
xxxxxx,xx,REMOTE,LOCAL,1865871,383666,22/10/2012,22/10/2012
         REMOTE        1865871     22/10/2012
          LOCAL         383666     22/10/2012
xxxxxx,xx,REMOTE,1180306134,19/10/2012
         REMOTE     1180306134     19/10/2012

and if you post the expected output we could help you more.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • **+1** ... Very succinct. Though from the look of the data, I would probably have used `if ($4 ~ /^M/)` instead of searching for `/LOCAL|REMOTE/`. – ghoti Jan 18 '13 at 12:46
  • 1
    @ghoti - that would fail when the 4th field is LOCAL as in the 6th line of the posted sample input. I could have used /[[:upper:]]/ but I'm really not sure if the OPs telling us everything about his input data or if the posted sample is even really representative. – Ed Morton Jan 18 '13 at 14:35
  • I just edited my post so that you have complete information of my input files – Asgard Jan 18 '13 at 15:27