Representation of a C binary file

Question

For a homework assignment I created a simple compression/decompression program that makes use of a naive implementation of run-length encoding. I've gotten my program working; compressing and decompressing any text file with a pretty large number of characters (e.g. the program source) works flawlessly. As an experiment I tried to compress/decompress the binary of the compression program itself. This resulted in a file that was much smaller than the original binary, and is obviously un-runnable. What is causing this data-loss?

My assumption was that it's related to how binary files are represented, but I can't figure much out past that.

@NPE No, my program just reads in from the input using `getchar()`. Isn't this just grabbing bits from the file 8 at at time and returning the integer value of them? — grimetime, Apr 01 '13 at 07:01
I'm not going to be able to do that right away, the program assignment due date is still a few days from now. — grimetime, Apr 01 '13 at 07:03
@grimetime: if you don't open the file/stream as binary, then on some platforms reading the file will transform line-endings to map them to a `'\n'` character (even for `getchar()`). Also, some platforms will treat a particular control character as an EOF (Windows does this when it encounters a Ctrl-Z if the file is opened in text mode). However, on Linux you will not run into these problems, but you should still open the files in binary mode in case the program is ever built for Windows. — Michael Burr, Apr 01 '13 at 07:06
"I'm not going to be able to do that right away" -- Then why even bother to post your question? Without the code, all anyone can do is guess what the cause is. — Jim Balter, Apr 01 '13 at 07:08
Read http://en.wikipedia.org/wiki/Executable_and_Linkable_Format to understand what is a binary executable on Linux — Basile Starynkevitch, Apr 01 '13 at 07:31

score 3 · Accepted Answer · answered Apr 01 '13 at 07:03

3

Possible issues:

Your program opens the binary file in the text mode, which damages the '\r' and '\n' bytes
Your program incorrectly handles zero bytes, treating them as ends of strings ('\0') and not as data of its own
Your program uses char (that is actually signed char) for the bytes of data and correctly works only with non-negative values, which ASCII chars of English text are, but fails to work with arbitrary char/byte values, which may be negative
Your program has an overflow somewhere which shows up only on big files
Your program has some other data-dependent bug

answered Apr 01 '13 at 07:03

Alexey Frunze

61,140
12
83
180

On Linux binary and text modes are the same.... – Basile Starynkevitch Apr 01 '13 at 08:11
@BasileStarynkevitch That's not set in stone. We haven't been told what compiler is used. – Alexey Frunze Apr 01 '13 at 08:20
The handling of the text mode vs binary mode is not compiler dependent. It is done by the standard libraries (`libc` or `libstdc++`); and on Linux they all handle binary & text likewise w.r.t. EOL. – Basile Starynkevitch Apr 01 '13 at 08:28
@BasileStarynkevitch I can have a different library with a different behavior and that will be OK with the language standard. – Alexey Frunze Apr 01 '13 at 08:35
But that won't be ok with the Linux standard bases. – Basile Starynkevitch Apr 01 '13 at 08:38
@BasileStarynkevitch Yep. – Alexey Frunze Apr 01 '13 at 08:44

score 1 · Answer 2 · answered Apr 01 '13 at 07:01

If the platform is linux (as the question is tagged), there's no difference between binary and text modes. So it shouldn't be that; but even so, the files should be opened as binary.

I suspect that your problem is the program treats '\0' characters as terminators (or otherwise specially) instead of as valid data.

Representation of a C binary file

2 Answers2