C: Low level character formatting: (enter+newline) using fgetc

Question

I'm working on a project on C that reads a text file and converts it to an array of booleans. First I read the file to a string of size n (is a unsigned char array), then I use a function to convert that string to a boolean array with size n * 8. The function works perfectly, no questions on that.

I get the string from the file using this code:

unsigned char *Data_in; // define pointer to string
int i;

FILE* sp = fopen("file.txt", "r"); //open file

fseek(sp, 0, SEEK_END);            // points sp to the end of file
int data_dim = ftell(sp);          // Returns the position of the pointer (amount of bytes from beginning to end)
rewind(sp);                        // points sp to the beginning of file

Data_in = (unsigned char *) malloc ( data_dim * sizeof(unsigned char) ); //allocate memory for string
unsigned char carac; //define auxiliary variable 

for(i=0; feof(sp) == 0; i++)       // while end of file is not reached (0)
{
   carac = fgetc(sp);              //read character from file to char
   Data_in[i] = carac;             // put char in its corresponding position
}
//

fclose(sp);                        //close file

The thing is that have a text file made by Notepad in Windows XP. Inside it I have this 4 character string ":\n\nC" (colon, enter key, enter key, capital C).

This is what it looks like with HxD (hex editor): 3A 0D 0A 0D 0A 43.

This table makes it clearer:

character             hex      decimal    binary
 :                    3A       58         0011 1010
 \n (enter+newline)   0D 0A    13 10      0000 1101 0000 1010    
 \n (enter+newline)   0D 0A    13 10      0000 1101 0000 1010
 C                    43       67         0100 0011

Now, I execute the program, which prints that part in binary, so I get:

character      hex      decimal      binary
 :             3A         58         0011 1010
 (newline)     0A         10         0000 1010    
 (newline)     0A         10         0000 1010
 C             43         67         0100 0011

Well, now that this is shown, I ask the questions:

Is the reading correct?
If so, why does it take the 0Ds out?
How does that work?

Your English is very understandable. It is rather charming too! — wallyk, May 29 '12 at 07:06

score 4 · Answer 1 · answered May 29 '12 at 07:01

4

Make the fopen binary:

fopen("file.txt", "rb");
                    ^

Otherwise your standard library will just eat away the \r (0x0D).

As a side note, opening the file in binary mode also mitigates another problem where a certain sequence in the middle of the file looks like EOF on DOS.

answered May 29 '12 at 07:01

cnicutar

178,505
25
365
392

Thats interesting, now it works perfectly. Also, your side note seems to have answered another question of another problem i think im having, thanks! – Machine-Code Reader May 29 '12 at 07:15

score 1 · Answer 2 · answered May 29 '12 at 07:04

1

It is because you're treating the file as an ASCII file. If you treat it as a binary file, you will be able to see both characters. For this use "rb" as the mode while opening the file. Also use fread to read the file contents.

answered May 29 '12 at 07:04

Superman

3,027
1
15
10

score 1 · Answer 3 · answered May 29 '12 at 07:16

In addition to the "rb" issue, there's one more error: you'll read an extra character at the end, because feof(sp) remains 0 after reading the last character. It is set to 1 only after you have attempted to read past EOF. This is a common beginner's mistake. The idiomatic C code to iterate over input characters is

int c;   /* int, not char due to EOF. */

while ((c = fgetc(sp)) != EOF) {
   /* Work with c. */
}

Jonathan Leffler · Answer 4 · 2012-05-29T07:27:20.877

The other answers have discussed binary vs text mode input.

Your code actually has a separate problem in it. This idiom is for Pascal, not C:

for (i = 0; feof(sp) == 0; i++)
{
   carac = fgetc(sp);
   Data_in[i] = carac;
}

The trouble is that when the fgetc() gets EOF, you treat it as a character (probably mapping it to ÿ, y-umlaut, U+00FF, LATIN SMALL LETTER Y WITH DIAERESIS). The feof() test is misplaced; it does not detect EOF in advance of the attempt to read the next character. Additionally, the function fgetc() and its relatives getc() and getchar() all return an int, not a char. You must learn to use the standard C idiom:

int c;
for (i = 0; (c = fgetc(sp)) != EOF; i++)
   Data_in[i] = c;

The idiom is the combination of assignment and test. The counting around it is less standard; in fact, it is likely to be fairly uncommon. But it is not wrong; it is applicable to your program.

There's no need to use feof() in most C code; virtually any time you use it, it is a mistake. Not always; it exists for a purpose. But that purpose is to distinguish between EOF and an error after a function such as fgetc() has returned EOF, not to test whether you've reached the EOF yet before a reading function says it has reached EOF. (In all my hundreds of programs, I don't think there are more than a very few references to feof(): 2884 source files, 18 references to feof(), and most of those in code originally written by other people.)

I dont know why you say its pascal code when i dont know a single bit of pascal, and im using mingw32 to compile all ansi-c code, but you have a point with the feof() thing, its really easier to use EOF. And isnt the same to use a char in this case? I mean, fgetc returns an int, but it can be interpreted as a char, setting aside what we/the compiler call it, its just 8 bits, right? Or could the fgetc function return a value greater than 255 and smaller than 2^32? Anyway, thanks for the answer, it was very informative! — Machine-Code Reader, May 29 '12 at 07:38
fgetc can't return a char, because in addition to the 256 possible char values, it needs to return a 257th: EOF, which is usually #defined as -1. So you need a type at least 9 bits wide. Using int was the choice of the language designers. — Jens, May 29 '12 at 08:07
@Machine-CodeReader The reason for suggesting Pascal is that it is an error in (standard) Pascal to attempt to read from a file that's reached EOF, so you have to test for EOF before trying the I/O (which is guaranteed not to fail on account of EOF). Jens has nicely summarized the reason for `fgetc()` returning `int`. It is one of the pitfalls that people fall into when learning C (more usually with `getchar()` than `fgetc()`, but the logic is the same). — Jonathan Leffler, May 29 '12 at 13:59

C: Low level character formatting: (enter+newline) using fgetc

4 Answers4