22

Awk offers associative indexing for array processing. Elements of 1 dimensional array can be iterated:

e.g.

for(index in arr1)
  print "arr1[" index "]=" arr1[index]

But how this kind done for a two dimensional array? Does kind of syntax,given below work?

for(index1 in arr2)
for(index2 in arr2)
   arr2[index1,index2]     
cobp
  • 772
  • 1
  • 5
  • 19
  • 1
    `gawk` as of v4 supports arrays as elements i.e. nested arrays, more flexible than multidimensional arrays, `for (i in arr2) for (j in arr2[i]) print arr2[i][j]`, see [JJoao's answer](http://stackoverflow.com/a/35891319/1290731) – jthill Mar 06 '17 at 21:29

5 Answers5

43

AWK fakes multidimensional arrays by concatenating the indices with the character held in the SUBSEP variable (0x1c). You can iterate through a two-dimensional array using split like this (based on an example in the info gawk file):

awk 'BEGIN { OFS=","; array[1,2]=3; array[2,3]=5; array[3,4]=8; 
  for (comb in array) {split(comb,sep,SUBSEP);
    print sep[1], sep[2], array[sep[1],sep[2]]}}'

Output:

2,3,5
3,4,8
1,2,3

You can, however, iterate over a numerically indexed array using nested for loops:

for (i = 1; i <= width; i++)
    for (j = 1; j < = height; j++)
        print array[i, j]

Another noteworthy bit of information from the GAWK manual:

To test whether a particular index sequence exists in a multidimensional array, use the same operator (in) that is used for single dimensional arrays. Write the whole sequence of indices in parentheses, separated by commas, as the left operand:

     (subscript1, subscript2, ...) in array

Gawk 4 adds arrays of arrays. From that link:

for (i in array) {
    if (isarray(array[i])) {
        for (j in array[i]) {
            print array[i][j]
        }
    }
    else
        print array[i]
}

Also see Traversing Arrays of Arrays for information about the following function which walks an arbitrarily dimensioned array of arrays, including jagged ones:

function walk_array(arr, name,      i)
{
    for (i in arr) {
        if (isarray(arr[i]))
            walk_array(arr[i], (name "[" i "]"))
        else
            printf("%s[%s] = %s\n", name, i, arr[i])
    }
} 
Dennis Williamson
  • 346,391
  • 90
  • 374
  • 439
6

No, the syntax

for(index1 in arr2) for(index2 in arr2) {
    print arr2[index1][index2];
}

won't work. Awk doesn't truly support multi-dimensional arrays. What it does, if you do something like

x[1,2] = 5;

is to concatenate the two indexes (1 & 2) to make a string, separated by the value of the SUBSEP variable. If this is equal to "*", then you'd have the same effect as

x["1*2"] = 5;

The default value of SUBSEP is a non-printing character, corresponding to Ctrl+\. You can see this with the following script:

BEGIN {
    x[1,2]=5;
    x[2,4]=7;
    for (ix in x) {
        print ix;
    }
}

Running this gives:

% awk -f scriptfile | cat -v
1^\2
2^\4

So, in answer to your question - how to iterate a multi-dimensional array - just use a single for(a in b) loop, but you may need some extra work to split up a into its x and y parts.

psmears
  • 26,070
  • 4
  • 40
  • 48
4

I'll provide an example of how I use this in my work processing query data. Suppose you have an extract file full of transactions by product category and customer id:

customer_id  category  sales
1111         parts     100.01
1212         parts       5.20
2211         screws      1.33
...etc...

Its easy to use awk to count total distinct customers with a purchase:

awk 'NR>1 {a[$1]++} END {for (i in a) total++; print "customers: " total}' \ 
datafile.txt

However, computing the number of distinct customers with a purchase in each category suggests a two dimensional array:

awk 'NR>1 {a[$2,$1]++} 
      END {for (i in a) {split(i,arr,SUBSEP); custs[arr[1]]++}
           for (k in custs) printf "category: %s customers:%d\n", k, custs[k]}' \
datafile.txt

The increment of custs[arr[1]]++ works because each category/customer_id pair is unique as an index to the associative array used by awk.

In truth, I use gnu awk which is faster and can do array[i][j] as D. Williamson mentioned. But I wanted to be sure I could do this in standard awk.

Merlin
  • 1,780
  • 1
  • 18
  • 20
3

The current versions of gawk (the gnu awk, default in linux, and possible to install everywhere you want), has real multidimensional arrays.

for(b in a)
   for(c in a[b])
      print a[b][c], c , b

See also function isarray()

JJoao
  • 4,891
  • 1
  • 18
  • 20
1

awk(1) was originally designed -- in part -- to be teaching tool for the C language, and multi-dimensional arrays have been in both C and awk(1) pretty much forever. as such POSIX IEEE 1003.2 standardized them.

To explore the syntax and semantics, if you create the following file called "test.awk":

BEGIN {
  KEY["a"]="a";
  KEY["b"]="b";
  KEY["c"]="c";
  MULTI["a"]["test_a"]="date a";
  MULTI["b"]["test_b"]="dbte b";
  MULTI["c"]["test_c"]="dcte c";
}
END {
  for(k in KEY) {
    kk="test_" k ;
    print MULTI[k][kk]
  }
  for(q in MULTI) {
    print q
  }
  for(p in MULTI) {
    for( pp in MULTI[p] ) {
      print MULTI[p][pp]
    }
  }
}

and run it with this command:

awk -f test.awk /dev/null

you will get the following output:

date a
dbte b
dcte c
a
b
c
date a
dbte b
dcte c

at least on Linux Mint 18 Cinnamon 64-bit 4.4.0-21-generic #37-Ubuntu SMP

Bob Makowski
  • 305
  • 2
  • 8
  • 2
    You are using nonstandard GNU extensions in your tests. These do not work in [mawk](https://invisible-island.net/mawk/), which explicitly "conforms to the Posix 1003.2 (draft 11.3) definition of the AWK language" (this refers to the second part of [POSIX before 1997](https://en.wikipedia.org/wiki/POSIX#Parts_before_1997) and is confusingly obsoleted by [IEEE Std 1003.1-2017, aka POSIX.1-2017](http://pubs.opengroup.org/onlinepubs/9699919799/)). The [current POSIX spec for awk](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html) still lacks references to your syntax. – Adam Katz Jan 30 '19 at 20:54
  • considering I helped write the awk(1) standard for POSIX IEEE 1003.2, I'm happy to point to that work and rely on it. as per the above documentation it works on awk that came installed Linux Mint 18 Cinnamon 64-bit 4.4.0-21-generic #37-Ubuntu SMP. – Bob Makowski Mar 04 '19 at 15:59
  • 1
    Thanks for your work on the spec! Could you link to the spec you wrote and point to where it mandates multi-dimensional array support? I couldn't find it. (Also, why is it missing from POSIX.1-2017?) Which implementation of awk comes with Mint 18 (`ls -l /etc/alternatives/awk`) and at what version? For me, `gawk 'BEGIN { a[2]["c"] = 4 }'` works fine (gawk 4.2.1) but `mawk 'BEGIN { a[2]["c"] = 4 }'` gives me a syntax error (mawk 1.3.3). – Adam Katz Mar 04 '19 at 19:45
  • First, the original question was a "how to solve the problem" and not a "what does the standard say". there's no dispute that the above code actually works on the platform I cited in my documentation. – Bob Makowski Jul 07 '19 at 21:27
  • Secondly, interesting that the link you provided actually contradicts your claim that multidimentional arrays are in that version of the standard. Because it specifically shows the following syntax: var[expr1, expr2, ... exprn] I found this true as far back as the Open Group IEEE joint specification "Single UNIX Specification version 3, POSIX:2001", which incorporates Issue 6 of their spec. " – Bob Makowski Jul 07 '19 at 21:40
  • It's interesting to note that I found nothing in the gawk documentation, e.g. their man pages, where GNU specify what version of POSIX they comply with. As for "drafts", I have about a dozen in my garage. Because early on, there was a dispute as to whether "nawk(1)" syntax from UNIX System V Release 3.1 would be admitted into the standard. Brian Kerningham himself emailed a congratulations because I was able to persuade the group to accept nawk(1) as the standard. At that time, I was an officer of IEEE 1003.2, and the subcommittee chair directly responsible for all the commands A-M. – Bob Makowski Jul 07 '19 at 21:44
  • 1
    The question is tagged `awk` and not `gawk`. This answer does not discuss the POSIX multi-dimensional array approximation, which uses commas to indicate `SUBSEP` delimiters for a second dimension. That format is a bit unwieldy since it's so hard to tease out and it can't facilitate a third dimension. More importantly, as I noted, `mawk` won't accept `a[2]["c"]` since it's not in any spec (beyond the `gawk` man/info pages). Many systems use the fully POSIX-compliant `mawk` or `nawk` (rather than `gawk`) as `/usr/bin/awk`. A `gawk`-only answer won't work for users of such systems. – Adam Katz Jul 08 '19 at 14:32
  • Ah, it would help if you figured out that on Linux, when you do "man awk" you get a manpage for gawk. Secondly, you beg the entire issue that the POSIX standard that you actually cited has the exact syntax you claimed did not exist in the standard. 3rd you are begging the question that my solution actually works. You are wasting folks' time with this. – Bob Makowski Jul 08 '19 at 14:39
  • 1
    `man awk` is tied to the manual for whatever you have installed as your `awk` binary; `gawk` for you and `mawk` for me. There are two instances of `][` in the [latest awk spec](https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html) and neither refer to true multidimensional arrays like `a[2]["c"]`. All I see are lines like “Because awk arrays are really one-dimensional, such a -separated list shall be converted to a single string by concatenating the string values of the separate expressions, each separated from the other by the value of the SUBSEP variable.” – Adam Katz Jul 09 '19 at 13:55
  • @BobMakowski as an author of the awk spec, you know surprisingly little about the syntax included there (it's been years though, everybody can forget things). The array syntax `a[2, "c"]` *is* in POSIX standard (for ages) and all awk implementations support it, while the syntax `a[2]["c"]` you are using is a GNU awk extension that was introduced in `gawk` version 4.0. You happen to have `gawk` on your Linux Mint system, but folks with other awks will need a different solution (most likely the one by Merlin, as he's the only one who gave non-`gawk` answer). – jena Feb 06 '23 at 12:14
  • Sorry, also the accepted answer has some non-`gawk` solution. But what does it matter, right? Gawk is easy to get everywhere, why bother? I even have a `gawk` [script](https://github.com/janxkoci/PopGen.awk/blob/main/scripts/vcfGTcount.gawk) that does what I need. But my data files are 200GB and `gawk` takes days. On the other hand, `mawk` is often *much* faster. So I care about code that can run with it, sometimes. – jena Feb 06 '23 at 22:06
  • @jena ad hominem attacks is no substitute for reality and is usually a pretext for false statements about your opponent. Bottom line: there's nothing wrong with the example, and nothing wrong with the cover note I provided. – Bob Makowski Feb 17 '23 at 20:25
  • awk's multi-dimensional array syntax is from C, and was inherited by C++, Java and groovy. awk's innovation to allow ascii as index has been there since the awk programming book was originally published. – Bob Makowski Feb 17 '23 at 20:33
  • @BobMakowski Sorry, I didn't mean to offend, I really was genuinely surprised (english is my second tongue). Anyway, you still seem to be confused about what we say here. Nobody disputes that _associative arrays_ are in POSIX awk - they are and were since the beginning. What we dispute is the `gawk`-specific syntax for _arrays of arrays_, which uses not one pair, but multiple pairs of square brackets. This syntax only works with `gawk`, which you happen to have on your LM. But some users need a portable solution that works with other awks and not just with `gawk`. I hope it's clearer now. – jena Feb 23 '23 at 13:32
  • @BobMakowski to help illustrate the problem, you can run your program like this: `gawk --posix -f test.awk /dev/null`. It _will_ fail. – jena Feb 23 '23 at 15:51
  • @BobMakowski @jena : `gawk`'s so called `--posix` mode is beyond horrific - not only do you have to manually loop array indices just to measure the array length, and its inconsistency in hex decoding, namely, `gawk -P` is willing to decode a string hex of "0xABC"` but will fail once you also turn on `GMP bigint` `gawk -MP`, recently made worse by ….. – RARE Kpop Manifesto Aug 02 '23 at 16:11
  • …. having this code that only does string comparison trigger an annoying error message : `printf $'\271' | gawk -P '{ print ($0 < "\333\271") }' gawk: cmd. line:1: (FILENAME=- FNR=1) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale` even though I didn't use any function like `length($0)` or `index($0, str1 )` (these would definitely trigger that message, even without `-P`, whenever input isn't well formed `UTF-8`) – RARE Kpop Manifesto Aug 02 '23 at 16:13
  • @RAREKpopManifesto That's beyond the point. The code in the answer will still only run under `gawk` and fail under any other version. That was my point. – jena Aug 03 '23 at 17:30
  • @jena : my point was that using `gawk -P` instead of another `awk` like `mawk` is a problematic frame of reference. – RARE Kpop Manifesto Aug 04 '23 at 10:41
  • yes, it will fail in all awks other than `gawk`, because it's using [a gawk-specific extension](https://www.gnu.org/software/gawk/manual/html_node/Arrays-of-Arrays.html) (see first sentence there) ️ how are people still not getting it? ️ – jena Aug 14 '23 at 11:34
  • @RAREKpopManifesto BTW in your code you compare strings (as in checking alphabetical order), which requires _collation order_. Since collation order is locale-dependent, gawk warns you about it, see [here](https://www.gnu.org/software/gawk/manual/html_node/POSIX-String-Comparison.html). Also, to my knowledge, gawk is the only version that actually supports multibyte data like UTF-8, while all other awks were built for ASCII only. But I don't really deal with multibyte data, so I may be wrong on this one (I'm 100% sure about the arrays though, as I use those _a ton_). – jena Aug 14 '23 at 11:50
  • @RAREKpopManifesto Also, when you say: "gawk's so called --posix mode is beyond horrific - not only do you have to manually loop array indices just to measure the array length" - this is from POSIX standard. Both gawk and mawk have an extension that lets you pass an array as argument to `length()`, while POSIX-compliant awks will throw an error. So blame POSIX, not gawk here ;) See the [full list of gawk extensions](https://www.gnu.org/software/gawk/manual/html_node/POSIX_002fGNU.html) for more. – jena Aug 14 '23 at 12:00
  • @jena : it's an illusion there's `POSIX-compliant awk` - `gawk` `mawk` `nawk` all fail at least some part of the specs - for starters - `nawk`s computes inaccurate list of characters for regex classes, `mawk`s have a half baked implementation `regex` replication interval, and `gawk` makes successful regex matches that has no relation to any known flavor of `regex` – RARE Kpop Manifesto Aug 17 '23 at 16:54
  • @jena : but even `gawk`, in typical `UTF-8` mode, has plenty of issues on its own - like ::::::: `echo 1234 | gawk '/(*|**)+/'` being a successful `regex` match even though no known `regex` flavor I'm aware of would match this input to this pattern (and these are already plain `ASCII` bytes we're talking) – RARE Kpop Manifesto Aug 17 '23 at 17:25
  • @jena : as for that "collation order" thing you're speaking of, `gawk` is very inconsistent, cuz running it as `gawk -e` `gawk -c` `gawk -t` `gawk -O` `gawk -S` `gawk -r` `gawk -M` `gawk -s` doesn't trigger the warning. Even better, it's still inconsistent behavior within `gawk -P` ::::: `{ print ("=" < "\371") }'` is a warning message but `{ print ("=" ~ "\371"), ("=" !~ "\371"), ("\371" ~ "="), ("\371" !~ "=") }'` etc regexes asking it to either match or match-against an explicit byte returns correct answer with no warnings, but only the `> >= <= <` ones would whine. – RARE Kpop Manifesto Aug 17 '23 at 17:39
  • @jena : if collation order were make or break then how come none of the other modes broke ? – RARE Kpop Manifesto Aug 17 '23 at 17:43
  • "if collation order were make or break then how come none of the other modes broke ?" - because gawk [explicitly](https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html) has this behaviour only in POSIX mode, while keeping sensible traditional behaviour in other modes. About your other examples, I mostly don't get them ️ – jena Aug 23 '23 at 14:49
  • btw I get a match with `echo 1234 | egrep '\(*|**\)+'` (sure, `grep` needs to escape the brackets, but otherwise it's the same, isn't it?).. – jena Aug 23 '23 at 14:53