The question's title is misleading and based on a fundamental misconception about awk
.
The naïve answer is that a space can simply be represented as itself (a literal) in regular expressions in awk
.
More generally, you can use [[:space:]]
to match a space, a tab or a newline (GNU Awk also supports \s
), and [[:blank:]]
to match a space or a tab.
However, the crux of the problem is that Awk, by default, splits each input line into fields by whitespace, so that, by definition, no input field itself contains whitespace, so any attempt to match a space in a field value will invariably fail.
The input at hand has fields that are a mix of unquoted and quoted strings, but POSIX Awk has no support for recognizing quoted strings as fields.
@fedorqui has made a valiant attempt to work around the problem by splitting input into fields by double quotes, but it's no substitute for proper recognition of quoted strings, because it doesn't preserve the true field boundaries.
If you have GNU Awk, you can approximate recognition of quoted strings using the special FPAT
variable, which, rather than defining a separator to split lines by, allows defining a regex that describes fields (and ignores tokens not recognized as such):
re='[[:alpha:]][[:alpha:] ]*[[:alpha:]]' # aux. shell variable
gawk -v FPAT="\"$re\"|'$re'" '{
for(i=1;i<=NF;++i) printf "%s%s", $i, (i==NF ? "\n" : " ")
}' sample.txt
This will work with single- and double-quoted strings.
Explanation:
FPAT="\"$re\"|'$re'"
defines fields to be either double- or single-quoted strings consisting only of letters and spaces, with at least one letter on either end (as in the OP's code).
- Note that this automatically excludes the UNquoted tokens on each input line - they will not be reflected in
$1
, ... and NF
.
- Therefore, the loop
for(i=1;i<=NF;++i)
is already limited to enumerating only the matching fields.
Note that, generally, the restrictions placed on the contents of the quoted strings in this case luckily bypass limitations inherent in this approach, namely the inability to deal with escaped nested quotes (of the same type).
If this limitation is acceptable, you can use the following idiom to tokenize input that is a mix of barewords (unquoted tokens) and quoted strings:
gawk -v "FPAT=[^[:blank:]]+|\"[^\"]*\"|'[^']*'" ...