Here's the deal: I need to read a specific amount of bytes, which will be processed later on. I've encountered a strange phenomenon though, and I couldn't wrap my head around it. Maybe someone else? :)
NOTE: The following code-examples are slimmed-down versions just to show the effect!
A way of doing this, at least with gawk
, is to set RS
to a catch-all regex, and then use RT
to see, what has been matched:
RS="[\x00-\xFF]"
Then, quite simply use the following awk-script:
BEGIN {
ORS=""
OFS=""
RS="[\x00-\xFF]"
}
{
print RT
}
This is working fine:
$ echo "abcdef" | awk -f bug.awk
abcdef
However, I'll need several files, to be accessed, so I am forced to use getline
:
BEGIN {
ORS=""
OFS=""
RS="[\x00-\xFF]"
while (getline)
{
print RT
}
}
This is seemingly equivalent to the above, but when running it, there is a nasty surprise:
$ echo "abcdef" | awk -f bug.awk
abc
This means, for some reason, getline
is encountering the EOF condition 3 bytes early. So, did I miss something, that I should know about the internals of bash/Linux buffering, or did I find a dreadful bug?
Just for the record: I am using GNU Awk 4.0.1 on Ubuntu 14.04 LTS (Linux 3.13.0/36)
Any tips, guys?
UPDATE: I am using getline
, because I have previously read and preprocessed the file(s), and stored in file(s) /dev/shm/
. Then I'd need to do a few final processing steps. The above examples are just bare minimum scripts, to show the problem.