GAWK premature EOF with getline

Question

Here's the deal: I need to read a specific amount of bytes, which will be processed later on. I've encountered a strange phenomenon though, and I couldn't wrap my head around it. Maybe someone else? :)

NOTE: The following code-examples are slimmed-down versions just to show the effect!

A way of doing this, at least with gawk, is to set RS to a catch-all regex, and then use RT to see, what has been matched:

RS="[\x00-\xFF]"

Then, quite simply use the following awk-script:

BEGIN {
  ORS=""
  OFS=""
  RS="[\x00-\xFF]"
}
{
  print RT
}

This is working fine:

$ echo "abcdef" | awk -f bug.awk
abcdef

However, I'll need several files, to be accessed, so I am forced to use getline:

BEGIN {
  ORS=""
  OFS=""
  RS="[\x00-\xFF]"

  while (getline)
  {
    print RT
  }
}

This is seemingly equivalent to the above, but when running it, there is a nasty surprise:

$ echo "abcdef" | awk -f bug.awk
abc

This means, for some reason, getline is encountering the EOF condition 3 bytes early. So, did I miss something, that I should know about the internals of bash/Linux buffering, or did I find a dreadful bug?

Just for the record: I am using GNU Awk 4.0.1 on Ubuntu 14.04 LTS (Linux 3.13.0/36)

Any tips, guys?

UPDATE: I am using getline, because I have previously read and preprocessed the file(s), and stored in file(s) /dev/shm/. Then I'd need to do a few final processing steps. The above examples are just bare minimum scripts, to show the problem.

I don't have a way to test this right now, but I would expect `awk` to honor a list of files from the cmd-line without need for getline. Did you try your original code with multiple files? pluse-uno for well researched problem. But yes, your test seems like it should work, and not obvious why you're losing data. Oh.. any chance of `\r\n` line endings in your data? If so, `dos2unix file` or `... | tr -d '\015' | ...`. Keep posting and Good Luck. — shellter, Jan 21 '16 at 17:54
Well, what I am doing, is reading in from `stdin`, and putting it in a temporary file in `/dev/shm`, and then reading that file back, and doing some processing. (The data should come from the network via `nc`) And thanks a lot, I'm trying my best... ;) — Dan, Jan 21 '16 at 17:56
Hm.. you said "However, I'll need several files, to be accessed, .." . For other readers, you may want to clarify that your `echo abcedfg | awk ..` relfects how you really want to deal with this problem. Good luck! — shellter, Jan 21 '16 at 18:04
For fun maybe try putting a `print ""` after the `while` loop --- in case there's some lingering output that's not getting flushed for some reason. — jas, Jan 21 '16 at 18:40
`I'll need several files, to be accessed, so I am forced to use getline` - that statement is completely unclear since awk doesn't require getline when reading multiple files and it gets even less clear once you tell us you're reading from `stdin` not from any files. Please edit your question to clearly state where awk is getting it's input from and, if you think you need to use `getline` then clearly state why that is the case. You might also want to read http://awk.info/?tip/getline. — Ed Morton, Jan 21 '16 at 20:13
There are several tools which will read a specified number of bytes. `head` is probably the simplest but `dd` is more flexible. `awk` is probably not the simplest choice. — rici, Jan 22 '16 at 00:14
Make that http://awk.freeshell.org/AllAboutGetline instead of awk.info — Ed Morton, Aug 04 '17 at 17:46

rici · Accepted Answer · 2016-01-22T01:45:38.203

1

Seems like this is a manifestation of the bug reported here, which (if I understand it correctly) has the effect of terminating the getline prematurely when close to the end of input, rather than at the end of input.

The bug fixes seem to have been committed on May 9 and May 10, 2014, so if you can upgrade to version 4.1 it should fix the problem.

If all you need to do is read a specified number of bytes, I'd suggest that awk is not the ideal tool, regardless of bugs. Instead, you might consider one of the following two standard utilities, which will be able to do the work rather more efficiently:

head -c $count

or

dd bs=$count count=1

With dd you can explicitly set the input file (if=PATH) and output file (of=PATH) if stdin/stdout are not appropriate. With head you can specify the input file as a positional parameter, but the output always goes to stdout.

See man head and man dd for more details.

edited Jan 22 '16 at 01:45

answered Jan 22 '16 at 01:38

rici

234,347
28
237
341

Well, then I did hit a **bug**. I'm not asking, how you've known about this. :) You're right, `awk` is probably not the right tool for this. But the thing is, I have to process line based data, with occasional blocks of octets in there, so it was rather tempting to try and get away with this. But I didn't. :) – Dan Jan 22 '16 at 10:22
@Dan: I didn't know, but I verified that it happens on v4.0 and not v4.1, since I have both available. Then I searched the buglist for getline, which made it pretty easy to find. Good luck with the project. – rici Jan 22 '16 at 14:08

score 0 · Answer 2 · answered Jan 21 '16 at 20:20

0

Fortunately, using GNU Awk 4.1.3 (on a Mac), your program with getline works as expected:

echo "abcdef" | gawk 'BEGIN{ORS="";OFS="";RS="[\x00-\xFF]";
  while (getline) {print RT}}'
abcdef
$ gawk --version
GNU Awk 4.1.3, API: 1.1

answered Jan 21 '16 at 20:20

peak

105,803
17
152
177

That will spin off into an infinite loop if the `getline` fails. See http://awk.info/?tip/getline. – Ed Morton Jan 21 '16 at 20:24
Just for the record, according to the [official doc](https://www.gnu.org/software/gawk/manual/html_node/Getline.html) it shouldn't: `[...] The getline command returns 1 if it finds a record and 0 if it encounters the end of the file. [...]` – Dan Jan 22 '16 at 00:54
@Dan: you left out the next sentence: "If there is some error... getline returns -1". -1, like 1, is true. – rici Jan 22 '16 at 13:50

GAWK premature EOF with getline

2 Answers2