1

I was doing cmp -l file.bin file2.bin but started to get cmp: EOF on file and suspected Windows/Linux line-ending problem as described here. I should be splitting binary data by some sequence so I did some profiling. I noticed that this problem is about only some of my files which are of size 1GB.

Output from od -c file.bin

0435500  \n   <A5>  \n   Y  \n   f  \n   p  \n   <A9>  \n   A  \n   W  \n 202
0435520  \n   <B0>  \n   M  \n   t  \n 202  \n   <B1>  \n   i  \n   i  \n 227
0435540  \n 221  \n   Y  \n   ;  \n   <B2>  \n 225  \n   <  \n   J  \n 217
0435560  \n   <A9>  \n   <  \n 211  \n   <AB>  \n 201  \n   T  \n   y  \n 204
0435600  \n 212  \n   \  \n   v  \n   p  \n   |  \n   9  \n   M  \n   u
0435620  \n 214  \n   <  \n   r  \n   <A0>  \n   <AF>  \n   X  \n   W  \n 204
0435640  \n   <A5>  \n   B  \n   a  \n 207  \n   <AA>  \n   S  \n   ^  \n   |
0435660 004  \r  \n   > 003   <ED> 003   <E8>  \f   . 003   <EC>  \f   * 004 032
0435700  \f   h  \f   m  \f   i  \f   h  \n   o 004 024  \n   k  \n   <A5>
0435720  \n   <A2>  \n   =  \n   k  \n   p  \n   <B1>  \n   I  \n   ^  \n   y
0435740  \n 227  \n   <  \n   T  \n   |  \n 224  \n   8  \n   w  \n 202

where you see one output of \r \n in line 0435660. In total, 11 matches on 11 lines when total lines of 60 characters is 0571520. So there seems to be Windows line-endings 0.001% of file content which is significantly smaller than in normal cases. Only, the minority of files have this problem, and the original data sources do not. This suggests me that this is a problem in data processing. Is this enough confirmation that those endings are Windows line-endings?

My files contains events which should have fixed length each. So I am not sure how well dos2unix will work here, since I cannot change the length of the event. I think I need to remove those events which have Windows line-endings or replace the windows EOF \r\n by \0\n. However, I am not sure if I can do this by adding the literal string into the content without changing the length of some events. The situation is if I change the length of some events, the system stops working.

How to work with Windows/Unix EOF warnings in binary data?

Community
  • 1
  • 1
Léo Léopold Hertz 준영
  • 134,464
  • 179
  • 445
  • 697

3 Answers3

2

Likely the reason for cmp: EOF on file is that files are of different length.

-l, --verbose
Output the (decimal) byte numbers and (octal) values of all differing bytes, instead of the default standard output. Also, output the EOF message if one file is shorter than the other. ref

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
  • 1
    The last line of Wikipage is missing in GNU and BSD manuals in `man`. I opened a new thread about getting better manuals in OSX here http://apple.stackexchange.com/q/194469/15504 but same problem is also in my Redhat Linux variant. – Léo Léopold Hertz 준영 Jul 04 '15 at 21:25
  • @Masi: The man page doesn't include this information, but it points to the version of the manual that does (for systems that use the GNU diffutils version of `cmp`): "The full documentation for `cmp` is maintained as a Texinfo manual. If the `info` and `cmp` programs are properly installed at your site, the command `info cmp` should give you access to the complete manual." – Keith Thompson Jul 04 '15 at 21:35
  • 1
    @Masi [@Keith Thompson](http://stackoverflow.com/a/31225273/2410359) is the better answer - suggest changing the accepted answer. – chux - Reinstate Monica Jul 04 '15 at 21:36
  • @chux In `info cmp`, I also see the `-l` flag without EOF mentioning. Also in Redhat variant. In reading info more, only `-l' --verbose' Print the (decimal) byte numbers and (octal) values of all differing bytes.`. – Léo Léopold Hertz 준영 Jul 04 '15 at 21:38
  • 1
    @chux: Older versions of the GNU `info` manual for `cmp` also omit the EOF message. See my updated answer for details. – Keith Thompson Jul 04 '15 at 22:18
2

The cmp command prints a message:

cmp: EOF on SHORTER-FILE

if one file is a prefix of the other, i.e., if one file is shorter than the other and the shorter file is identical to the beginning of the longer file.

If the two files are of different lengths but the shorter file is not a prefix of the longer one, cmp will report the first byte offset at which they differ, without an EOF warning.

On my system, the cmp(1) man page doesn't mention this, but it refers to the full documentation, which does.

If the GNU diffutils info documentation is not installed, or is configured incorrectly, the info command falls back to showing the man page.

On CentOS 5.11 (essentially identical to Red Hat), info diff shows the diffutils documentation; navigating to "Invoking cmp" shows the documentation for the cmp command. But it's an older version of the documentation, which is missing the information about the EOF message. (The diffutils 2.8.1 manual doesn't mention the EOF message; the diffutils 3.3 manual does.) Examining the history in the git repo, the wording was added in 2002 and first included in release 2.8.2. To see which version of GNU cmp you're running, type cmp --version. (The behavior was there all along; the documentation was updated to reflect it.)

The OSX cmp(1) man page is also the GNU diffutils version; it refers to the info documentation, but it also appears to be for version 2.8.1, which doesn't mention the EOF message.

Documentation for the current GNU diffutils version: http://www.gnu.org/software/diffutils/manual/html_node/Invoking-cmp.html

POSIX requires the same behavior: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/cmp.html

Keith Thompson
  • 254,901
  • 44
  • 429
  • 631
  • Yes, the full documentation GNU and BSD does not have mentioning of EOF. At least not in those two manuals and not in my GNU/BSD manuals. It is only in that wikipage which source is some Physics department. POSIX manual is implicitly mentioning EOF but not so clearly as wiki. – Léo Léopold Hertz 준영 Jul 04 '15 at 21:45
  • @Masi: As I said, the *man page* doesn't, but the full GNU documentation does. I linked to it in my answer; you should also be able to see it on your system by typing `info cmp`. – Keith Thompson Jul 04 '15 at 21:48
  • Yes, I pasted what I see to the comment of chux's answer. My info cmp > its options is incomplete. No EOF mentioning in BSD and Redhat Linux variant. Actually, you should read `info diff`, not `info cmp` since it is incomplete. – Léo Léopold Hertz 준영 Jul 04 '15 at 21:49
  • @Masi: If the `info` documentation isn't installed, the `info` command falls back to the man page. On a CentOS 5.11 system, `info cmp` shows the man page; `info diff` shows the `diffutils` documentation, which includes the `cmp` documentation. But it's an older version with incomplete information on `cmp`. I've updated my answer. – Keith Thompson Jul 04 '15 at 22:13
0

Read manuals in

info diff

and browse to the cmp sections. However, still incomplete manuals in GNU and BSD in OSX 10.10.3 and Redhat Linux variant.

Léo Léopold Hertz 준영
  • 134,464
  • 179
  • 445
  • 697