4

I'm working on an Awk/Gawk script that parses a file, populating a multidimensional array for each line. The first column is a period delimited string, with each value being a reference to the array key for the next level. The 2nd column is the value

Here's an example of what the content being parsed looks like:

$ echo -e "personal.name.first\t= John\npersonal.name.last\t= Doe\npersonal.other.dob\t= 05/07/87\npersonal.contact.phone\t= 602123456\npersonal.contact.email\t= john.doe@idk\nemployment.jobs.1\t= Company One\nemployment.jobs.2\t= Company Two\nemployment.jobs.3\t= Company Three"
personal.name.first     = John
personal.name.last      = Doe
personal.other.dob      = 05/07/87
personal.contact.phone  = 602123456
personal.contact.email  = john.doe@idk
employment.jobs.1       = Company One
employment.jobs.2       = Company Two
employment.jobs.3       = Company Three

Which after being parsed, Im expecting it to have the same structure as:

data["personal"]["name"]["first"]     = "John"
data["personal"]["name"]["last"]      = "Doe"
data["personal"]["other"]["dob"]      = "05/07/87"
data["personal"]["contact"]["phone"]  = "602123456"
data["personal"]["contact"]["email"]  = "john.doe@foo.com"
data["employment"]["jobs"]["1"]       = Company One
data["employment"]["jobs"]["2"]       = Company Two
data["employment"]["jobs"]["3"]       = Company Three

The part that I'm stuck on is how to dynamically populate the keys while structuring the multidimensional array.

I found this SO thread that covers a similar issue, which was resolved by using the SUBSEP variable, which at first seemed like it would work as I needed, but after some testing, it looks like arr["foo", "bar"] = "baz" doesn't get treated like a real array, such as arr["foo"]["bar"] = "baz" would. An example of what I mean by that would be the inability to count the values in any level of the array: arr["foo", "bar"] = "baz"; print length(arr["foo"]) would simply print a 0 (zero)

I found this SO thread which helps a little, possibly pointing me in the right direction.

In a snippet in the thread mentioned:

BEGIN {
  x=SUBSEP

  a="Red" x "Green" x "Blue"
  b="Yellow" x "Cyan" x "Purple"

  Colors[1][0] = ""
  Colors[2][0] = ""

  split(a, Colors[1], x)
  split(b, Colors[2], x)

  print Colors[2][3]
}

Is pretty close, but the problem I'm having now is the fact that the keys (EG: Red, Green, etc) need to be specified dynamically, and there could be one or more keys.

Basically, how can I take the a_keys and b_keys strings, split them by ., and populate the a and b variables as multidimensional arrays?..

BEGIN {
  x=SUBSEP

  # How can I take these strings...
  a_keys = "Red.Green.Blue"
  b_keys = "Yellow.Cyan.Purple"

  # .. And populate the array, just as this does:
  a="Red" x "Green" x "Blue"
  b="Yellow" x "Cyan" x "Purple"

  Colors[1][0] = ""
  Colors[2][0] = ""

  split(a, Colors[1], x)
  split(b, Colors[2], x)

  print Colors[2][3]
}

Any help would be appreciated, thanks!

Community
  • 1
  • 1
Justin
  • 1,959
  • 5
  • 22
  • 40

2 Answers2

2

All you need is:

BEGIN { FS="\t= " }
{
    split($1,d,/\./)
    data[d[1]][d[2]][d[3]] = $2
}

Look:

$ cat tst.awk
BEGIN { FS="\t= " }
{
    split($1,d,/\./)
    data[d[1]][d[2]][d[3]] = $2
}
END {
    for (x in data)
        for (y in data[x])
            for (z in data[x][y])
                print x, y, z, "->", data[x][y][z]
}

$ awk -f tst.awk file
personal other dob -> 05/07/87
personal name first -> John
personal name last -> Doe
personal contact email -> john.doe@idk
personal contact phone -> 602123456
employment jobs 1 -> Company One
employment jobs 2 -> Company Two
employment jobs 3 -> Company Three

The above is gawk-specific of course since no other awk supports true multi-dimensional arrays.

To populate a multi-dimensional array when the indices aren't always of the same depth (e.g. 3 above), it's rather more complicated:

##########
$ cat tst.awk
function rec_populate(a,idxs,curDepth,maxDepth,tmpIdxSet) {
    if ( tmpIdxSet ) {
        delete a[SUBSEP]                # delete scalar a[]
        tmpIdxSet = 0
    }
    if (curDepth < maxDepth) {
        # We need to ensure a[][] exists before calling populate() otherwise
        # inside populate() a[] would be a scalar, but then we need to delete
        # a[][] inside populate() before trying to create a[][][] because
        # creating a[][] below creates IT as scalar. SUBSEP used arbitrarily.

        if ( !( (idxs[curDepth] in a) && (SUBSEP in a[idxs[curDepth]]) ) ) {
            a[idxs[curDepth]][SUBSEP]   # create array a[] + scalar a[][]
            tmpIdxSet = 1
        }
        rec_populate(a[idxs[curDepth]],idxs,curDepth+1,maxDepth,tmpIdxSet)
    }
    else {
        a[idxs[curDepth]] = $2
    }
}

function populate(arr,str,sep,  idxs) {
    split(str,idxs,sep)
    rec_populate(arr,idxs,1,length(idxs),0)
}

{ populate(arr,$1,",") }

END { walk_array(arr, "arr") }

function walk_array(arr, name,      i)
{
    # Mostly copied from the following URL, just added setting of "sorted_in":
    #   https://www.gnu.org/software/gawk/manual/html_node/Walking-Arrays.html
    PROCINFO["sorted_in"] = "@ind_str_asc"
    for (i in arr) {
        if (isarray(arr[i]))
            walk_array(arr[i], (name "[" i "]"))
        else
            printf("%s[%s] = %s\n", name, i, arr[i])
    }
}

.

##########
$ cat file
a uno
b,c dos
d,e,f tres_wan
d,e,g tres_twa
d,e,h,i,j cinco

##########
$ awk -f tst.awk file
arr[a] = uno
arr[b][c] = dos
arr[d][e][f] = tres_wan
arr[d][e][g] = tres_twa
arr[d][e][h][i][j] = cinco
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • I may have forgotten to mention that the keys may not always be exactly three segments.. But this is a good enough start for me to work with. Thanks! – Justin May 15 '17 at 16:45
  • Then you'll need the second solution I posted. Stating the obvious - if your real data has multiple ranks of segments then your sample input should have too. – Ed Morton May 15 '17 at 17:22
0

without real multidim arrays, you can do little more bookkeeping

awk -F'\t= ' '{split($1,k,"."); 
               k1[k[1]]; k2[k[2]]; k3[k[3]]; 
               v[k[1],k[2],k[3]]=$2}
          END {for(i1 in k1) 
                 for(i2 in k2)
                   for(i3 in k3) 
                     if((i1,i2,i3) in v) 
                       print i1,i2,i3," -> ",v[i1,i2,i3]}' file


personal other dob  ->  05/07/87
personal name first  ->  John
personal name last  ->  Doe
personal contact email  ->  john.doe@idk
personal contact phone  ->  602123456
employment jobs 1  ->  Company One
employment jobs 2  ->  Company Two
employment jobs 3  ->  Company Three
karakfa
  • 66,216
  • 7
  • 41
  • 56