8

I am processing a number of large text files, ie. converting them all from one format to another. There are some small differences in the original formats of the files, but - with a bit of pre-processing in a few cases - they are mostly being successfully converted with a bash shellscript I have created.

So far so good, but one thing is puzzling me. At one point the script sets a variable called $iterations, so that it knows how many times to perform a particular for-loop. This value is determined by the number of empty lines in a temporary file that is created by the script.

Thus, the original version of my script contained the line:

    iterations=$(cat tempfile | grep '^$' | wc -l)

This has worked fine so far with all but one of the text files, which didn't seem to set the $iterations variable correctly, giving a value of '1' even though there appeared to be more than 20,000 empty lines in tempfile.

However, having discovered grep -c, I changed the line to:

    iterations=$(cat tempfile | grep -c '^$')

and the script suddenly worked, ie. $iterations was set correctly.

Can anyone explain why the two versions produce different results? And why the first version would work on some files and not others? Is there some upper limit value above which wc -l defaults to 1? The file which wouldn't work with the first version is one of the largest, but not the largest in the set (which converted correctly the first time).

anatolyg
  • 26,506
  • 9
  • 60
  • 134
John W
  • 81
  • 1
  • 3
  • 2
    Can you replicate this? That is, do you have a file for which `grep -c '^$'` produces output different than `grep '^$' | wc -l`? – William Pursell Apr 18 '17 at 16:46
  • I wonder if the file contains something funny that confuses `wc`, would `cat tempfile | grep '^$' | hexdump -C | head` produce anything interesting? – Dima Chubarov Apr 18 '17 at 16:46
  • 2
    `printf 'foo\nbar\n\x00\n\n\n\n' | { cat > /tmp/file; grep -c '^$' < /tmp/file; grep '^$' < /tmp/file | wc -l; }` Dmitri's got it. With a null character, `wc` produces `1`, while `grep -c` counts 4. – William Pursell Apr 18 '17 at 16:49
  • 2
    Of course, the problem is that `grep` is printing `Binary file (standard input) matches`, and wc is counting that line! – William Pursell Apr 18 '17 at 16:53
  • Another reason could be that grep 2.13 wrongly treats some files as binary, e.g. large files stored on filesystems that implement deduplication. This was corrected in 2.14 [(git log)](http://git.savannah.gnu.org/cgit/grep.git/commit/?h=v2.14&id=aa99fbeb9ff85ce8ff024c5ace4e0690726e0ca7) and later versions. – Dima Chubarov Apr 19 '17 at 01:55

1 Answers1

11

If the input is not a text file, then grep will print the single line Binary file (standard input) matches, and wc -l will count that line! But grep -c will happily count the number of matches in the file.

William Pursell
  • 204,365
  • 48
  • 270
  • 300
  • @dmitri: I see (I think)... somewhere in that large text file, there must be a fortuitous character sequence which `grep` (without `-c`) interprets as a null character? I'd never have thought of that. I've never come across the null character; I guess it must have its uses. :-) – John W Apr 18 '17 at 20:35
  • 1
    Not necessarily a nul. Could be any character that causes grep to treat the file as a binary file. – William Pursell Apr 18 '17 at 20:54