23

I'm working on a shell script that will be used by others, and may ingest suspect strings. It's based around awk, so as a basic resiliency measure, I want to have awk output null-terminated strings - the commands that will receive data from awk can thus avoid a certain amount of breakage from strings that contain spaces or not-often-found-in-English characters.

Unfortunately, from the basic awk documentation, I'm not getting how to tell awk to print a string terminated by an ASCII null instead of by a newline. How can I tell awk that I want null-terminated strings?


Versions of awk that might be used:

[user@server1]$ awk --version
awk version 20070501

[user@server2]$ awk -W version
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan

[user@server3]$ awk -W version
GNU Awk 3.1.7

So pretty much the whole family of awk versions. If we have to consolidate on a version, it'll probably be GNU Awk, but answers for all versions are welcome since I might have to make it work across all of these awks. Oh, legacy scripts.

Brighid McDonnell
  • 4,293
  • 4
  • 36
  • 61

4 Answers4

22

There are three alternatives:

  1. Setting ORS to ASCII zero: Other solutions have awk -vORS=$'\0' but:
    The $'\0' is a construct specific to some shells (bash,zsh).
    So: this command awk -vORS=$'\0' will not work in most older shells.

There is the option to write it as: awk 'BEGIN { ORS = "\0" } ; { print $0 }', but that will not work with most awk versions.

  1. Printing (printf) with character \0: awk '{printf( "%s\0", $0)}'

  2. Printing directly ASCII 0: awk '{ printf( "%s%c", $0, 0 )}'

Testing all alternatives with this code:

#!/bin/bash

test1(){   # '{printf( "%s%c",$0,0)}'|
    a='awk,mawk,original-awk,busybox awk'
    IFS=',' read -ra line <<<"$a"
    for i in "${line[@]}"; do
        printf "%14.12s %40s" "$i" "$1"
        echo -ne "a\nb\nc\n" |
        $i "$1"|
        od -cAn;
    done
}

#test1 '{print}'
test1 'BEGIN { ORS = "\0" } ; { print $0 }'
test1 '{ printf "%s\0", $0}'
test1 '{ printf( "%s%c", $0, 0 )}'

We get this results:

            awk      BEGIN { ORS = "\0" } ; { print $0 }   a  \0   b  \0   c  \0
           mawk      BEGIN { ORS = "\0" } ; { print $0 }   a   b   c
   original-awk      BEGIN { ORS = "\0" } ; { print $0 }   a   b   c
    busybox awk      BEGIN { ORS = "\0" } ; { print $0 }   a   b   c
            awk                     { printf "%s\0", $0}   a  \0   b  \0   c  \0
           mawk                     { printf "%s\0", $0}   a   b   c
   original-awk                     { printf "%s\0", $0}   a   b   c
    busybox awk                     { printf "%s\0", $0}   a   b   c
            awk               { printf( "%s%c", $0, 0 )}   a  \0   b  \0   c  \0
           mawk               { printf( "%s%c", $0, 0 )}   a  \0   b  \0   c  \0
   original-awk               { printf( "%s%c", $0, 0 )}   a  \0   b  \0   c  \0
    busybox awk               { printf( "%s%c", $0, 0 )}   a   b   c

As it can be seen above, the first two solutions work only in GNU AWK.

The most portable is the third solution: '{ printf( "%s%c", $0, 0 )}'.

No solution work correctly in "busybox awk".

The versions used for this tests were:

          awk> GNU Awk 4.0.1
         mawk> mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan
 original-awk> awk version 20110810
      busybox> BusyBox v1.20.2 (Debian 1:1.20.0-7) multi-call binary.
  • 1
    Many blessings on you for specifying the versions that you used! The problem that inspired this question has long since become Not Mine, but it does my heart good to see people leaving helpful, diligent answers. Well done. – Brighid McDonnell Nov 23 '15 at 22:11
  • Thank you, the %c option was just what I was looking for. It's perfect that it doesn't depend on the current shell's escaping magic. – Javier C Jul 02 '19 at 15:41
21

Alright, I've got it.

awk '{printf "%s\0", $0}'

Or, using ORS,

awk -vORS=$'\0' //
Kevin
  • 53,822
  • 15
  • 101
  • 132
  • 1
    When I pipe the results of those incantations into `xargs -0`, it doesn't split on the `\0` that awk is inserting (tested by splitting on something else). :( – Brighid McDonnell Feb 03 '12 at 19:55
  • @SeanM The first seems not to work, but the second is working for me, are you quite sure the problem is in `awk`? (try saving the output from just that to a file) – Kevin Feb 03 '12 at 20:06
  • That didn't work on all three platforms, but it led me to figure out that I could do what I wanted with Perl - which is what always seems to happen when I want to do anything remotely complex with awk or sed. Since your answer worked at least part of the time and put me on a path to a solution, I'm accepting it. :) – Brighid McDonnell Feb 03 '12 at 20:39
  • 1
    You can check awk's actual output by piping to `od -cAn`. I found that gawk would output the NUL bytes, but BusyBox awk and nawk on FreeBSD wouldn't. The sandrotosi.blogspot.com technique of `printf "%c",""` didn't work on those implementations either. – dubiousjim Apr 19 '12 at 03:29
  • 2
    I had to use double-quotes for the `-vORS` argument `awk -vORS=$"\0"`. This was with gawk 4.0.1. – Christian Long Mar 27 '15 at 18:44
  • 2
    `-v` isn't supported by BSD awk, e.g. the one in OSX. Neither inserting `\0` into a string works in it, it's treated as the end of the string instead. – ivan_pozdeev Jun 29 '20 at 11:21
7

You can also pipe your awk's output through tr:

awk '{...code...}' infile | tr '\n' '\0' > outfile

Just tested, it works at least on Linux and FreeBSD.

If you cannot use newlines as separators (for example, if output records can contain newlines inside), just use some other character that's guaranteed not to appear inside a record, e.g. the one with code 1:

awk 'BEGIN { ORS="\001" } {...code...}' | tr '\001' '\0'
Macaronio
  • 136
  • 1
  • 2
  • 1
    From what I've seen, this is the most portable and reliable answer. `tr '\n' '\0'` even works in busybox (unlike any use of null characters in busybox's `awk`). Rather than using `\001` (Start of Heading), I recommend `\036` (U+001e, Information Separator Two, a.k.a. Record Separator, RS) since the information separators are made for this purpose. (#2/RS maps to lines (awk's default ORS) while #1, Unit Separator, would be akin to awk's FS.) More at https://en.wikipedia.org/wiki/Delimiter#ASCII_delimited_text – Adam Katz Mar 15 '19 at 14:55
  • Since UNIX paths can contain any bytes except `\0`, you are not doing it right if you use anything else, even if you replace it with `\0` afterwards: any inline bytes with the same code would be replaced, too. – ivan_pozdeev Jun 29 '20 at 16:00
-1

I've solved printing ASCII 0 from awk. I use UNIX command printf "\000"

echo | awk -v s='printf "\000"' '{system(s);}'
potame
  • 7,597
  • 4
  • 26
  • 33
suzhor
  • 1