Reading all content from a text file - C

Question

I am trying to read all content from a text file. Here is the code which I wrote.

#include <stdio.h>
#include <stdlib.h>

#define PAGE_SIZE 1024

static char *readcontent(const char *filename)
{
    char *fcontent = NULL, c;
    int index = 0, pagenum = 1;
    FILE *fp;
    fp = fopen(filename, "r");

    if(fp) {
        while((c = getc(fp)) != EOF) {
            if(!fcontent || index == PAGE_SIZE) {
                fcontent = (char*) realloc(fcontent, PAGE_SIZE * pagenum + 1);
                ++pagenum;
            }
            fcontent[index++] = c;
        }
        fcontent[index] = '\0';
        fclose(fp);
    }
    return fcontent;
}

static void freecontent(char *content)
{
    if(content) {
        free(content);
        content = NULL;
    }
}

This is the usage

int main(int argc, char **argv)
{
    char *content;
    content = readcontent("filename.txt");
    printf("File content : %s\n", content);
    fflush(stdout);
    freecontent(content);
    return 0;
}

Since I am new to C, I wonder whether this code looks perfect? Do you see any problems/improvements?

Compiler used : GCC. But this code is expected to be cross platform.

Any help would be appreciated.

Edit

Here is the updated code with fread and ftell.

static char *readcontent(const char *filename)
{
    char *fcontent = NULL;
    int fsize = 0;
    FILE *fp;

    fp = fopen(filename, "r");
    if(fp) {
        fseek(fp, 0, SEEK_END);
        fsize = ftell(fp);
        rewind(fp);

        fcontent = (char*) malloc(sizeof(char) * fsize);
        fread(fcontent, 1, fsize, fp);

        fclose(fp);
    }
    return fcontent;
}

I am wondering what will be the relative complexity of this function?

i hardly think using that as a filename will get you much kudos — Mitch Wheat, Aug 01 '10 at 06:57
ahh.. Sorry for that. I was testing and forgot to remove it. Extremely sorry. — Navaneeth K N, Aug 01 '10 at 07:03
I think in general you should try to work in fixed chunks; so in this case you would be reading in PAGE_SIZE bytes at a time (or less if it's the last chunk) and printing each chunk as you read them in. — wj32, Aug 01 '10 at 07:11
Do you know what your code does if you don't have read permission for "filename.txt"? Is that what you want it to do? — msw, Aug 01 '10 at 07:24
@msw : in my application I have checked all the return code and emiting proper messages. Omitted in the example for clarity. — Navaneeth K N, Aug 01 '10 at 07:30

Martin Ingvar Kofoed Jensen · Accepted Answer · 2010-08-01T07:26:50.690

12

You should try look into the functions fsize (About fsize, see update below) and fread. This could be a huge performance improvement.

Use fsize to get the size of the file you are reading. Use this size to do one alloc of memory only. (About fsize, see update below. The idea of getting the size of the file and doing one alloc is still the same).

Use fread to do block reading of the file. This is much faster than single charecter reading of the file.

Something like this:

long size = fsize(fp);
fcontent = malloc(size);
fread(fcontent, 1, size, fp);

Update

Not sure that fsize is cross platform but you can use this method to get the size of the file:

fseek(fp, 0, SEEK_END); 
size = ftell(fp);
fseek(fp, 0, SEEK_SET);

edited Aug 01 '10 at 07:26

answered Aug 01 '10 at 06:58

Martin Ingvar Kofoed Jensen

4,241
2
29
31

Thanks. I looked for documentation of `fsize`, but couldn't find one. Is this a platform independent function? How `fsize` can tell the file size without reading the whole file? – Navaneeth K N Aug 01 '10 at 07:07
`fsize` looks like it's Windows-specific. `stat(2)` is the UNIX equivalent. – Wang Aug 01 '10 at 07:14
Don't use `stat` for this purpose. If the "file" is not a normal file but something else (perhaps a hard disk partition) you will not get the size. Always use the seek-to-end method for determining the size. If you intend to support reading from non-seekable sources (like a pipe or socket) then you should probably also support the incremental-realloc approach if `ftell` returns -1. – R.. GitHub STOP HELPING ICE Aug 01 '10 at 07:50
@R. sure one should use `stat` in this case. The question explicitly states that this is about a text file. – Jens Gustedt Aug 01 '10 at 12:10

score 2 · Answer 2 · answered Aug 01 '10 at 07:04

People often realloc to twice the existing size to get amortized constant time instead of linear. This makes the buffer no more than twice as large, which is usually okay, and you have the option of reallocating back down to the correct size after you're done.

But even better is to stat(2) for the file size and allocate once (with some extra room if the file size is volatile).

Also, why you don't either fgets(3) instead of reading character by character, or, even better, mmap(2) the entire thing (or the relevant chunk if it's too large for memory).

score 2 · Answer 3 · answered Aug 01 '10 at 07:21

2

It is probably slower and certainly more complex than:

while((c = getc(fp)) != EOF) {
    putchar(c);
}

which does the same thing as your code.

answered Aug 01 '10 at 07:21

msw

42,753
9
87
112

score 1 · Answer 4 · answered Aug 01 '10 at 07:39

1

This is from a quick reading, so I might have missed a few issues.

First, a = realloc(a, ...); is wrong. If realloc() fails, it returns NULL, but doesn't free the original memory. Since you reassign to a, the original memory is lost (i.e., it is a memory leak). The right way to do this is to do: tmp = realloc(a, ...); if (tmp) a = tmp; etc.

Second, about determining the file size using fseek(fp, 0, SEEK_END);, note that this may or may not work. If the file is not random-access (such as stdin), you won't be able to go back to the beginning to read it. Also, fseek() followed by ftell() may not give a meaningful result for binary files. And for text files, it may not give you the right number of characters that can be read. There is some useful information on this topic on comp.lang.c FAQ question 19.2.

Also, in your original code, you don't set index to 0 when it equals PAGESIZE, so if your file length is greater than 2*PAGESIZE, you will overwrite the buffer.

Your freecontent() function:

static void freecontent(char *content)
{
    if(content) {
        free(content);
        content = NULL;
    }
}

is useless. It only sets a copy of content to NULL. It is just like if you wrote a function setzero like this:

void setzero(int i) { i = 0; }

A much better idea is to keep track of memory yourself and not free anything more or less than needed.

You shouldn't cast the return value of malloc() or realloc() in C, since a void * is implicitly converted to any other object pointer type in C.

Hope that helps.

answered Aug 01 '10 at 07:39

Alok Singhal

93,253
21
125
158

`stdin` is seekable if it refers to a seekable file. It's not seekable if it's an interactive device, pipe, etc. `fseek`/`ftell` **is** reliable on binary files on any reasonable system. Yes the C standard grandfathers-in legacy implementations where binary files can have random trailing zero bytes, but this is 2010 and all real present-day systems have real binary files. Text mode simply should not be used due to unpredictable and buggy behavior. Just strip the `\r`'s yourself. – R.. GitHub STOP HELPING ICE Aug 01 '10 at 08:04
@R..: On my Mac, `fseek(stdin, 0, SEEK_END)` succeeds, `ftell()` returns 0, and then I am able to read as many characters from `stdin` as I want. On linux, `fseek(stdin, 0, SEEK_END);` results in `Illegal seek` (the same program). I would prefer a `realloc()` based approach because then I won't have to deal with things like stripping `\r` myself, and it works for non-seekable files too. – Alok Singhal Aug 01 '10 at 08:21
Unless there's a reason you need the whole file in memory, you should probably follow msw's answer, which has no failure cases and easily provable correctness. BTW if you want to strip `\r` (for example from Windows text files) you'll have to do it yourself anyway. Only Windows and legacy Mac (pre-OSX) have "text mode" file operations which mangle the data. POSIX requires text mode to behave identically to binary mode, and it does on OSX, Linux, etc. – R.. GitHub STOP HELPING ICE Aug 01 '10 at 08:53
@Alok: Thanks. You have a very valid point here. I understand using ftell() and fseek() to find the file size is not the correct way. https://www.securecoding.cert.org/confluence/display/seccode/FIO19-C.+Do+not+use+fseek%28%29+and+ftell%28%29+to+compute+the+size+of+a+file explains that. So are you saying I should use the code which I have first with the changes suggested by you? – Navaneeth K N Aug 01 '10 at 14:34
@R.. of course, if the whole aim is to print the file back, one doesn't need complicated code. `while ((c = getchar()) != EOF)` or `while ((nread = fread(buf, 1, sizeof buf, fp) > 0)` both are easier and simpler :-). Interesting info about the requirement in POSIX. I didn't know that - thanks! – Alok Singhal Aug 01 '10 at 15:25
@Appu: Yes, my suggestion is to do what you did in your first approach, but with my changes, and with using `fread()` instead of reading character-by-character. Also make sure the file being read isn't huge :-). – Alok Singhal Aug 01 '10 at 15:26

score 1 · Answer 5 · edited Aug 12 '11 at 14:05

1

One problem I can see here is variable index which is non-decreasing. So the condition if(!fcontent || index == PAGE_SIZE) will be true only once. So I think check should be like index%PAGE_SIZE == 0 instead of index == PAGE_SIZE.

edited Aug 12 '11 at 14:05

Archie

6,391
4
36
44

answered Aug 12 '11 at 06:23

sudish

11
1

score 0 · Answer 6 · answered Aug 01 '10 at 07:37

0

On POSIX systems (e.g linux) you could get the same effect with the system call mmap that maps all your file in memory. It has an option to map that file copy on write, so you would overwrite your file if you change the buffer.

This would usually be much more efficient, since you leave as much as you can to the system. No need to do realloc or similar.

In particular, if you are only reading and several processes do that at the same time there would be only one copy in memory for the whole system.

answered Aug 01 '10 at 07:37

Jens Gustedt

76,821
6
102
177

I think you're confused about what copy-on-write means. If the file is mapped copy-on-write (private), the map is originally just a reference to the on-disk file, but any changes you make to it will result in a copy of the data that's local to your process. If it's mapped shared, then your changes will be written to the file and visible by other processes. – R.. GitHub STOP HELPING ICE Aug 01 '10 at 08:01
@R. a reference to the on-disk file? sure all `mmap` does that this is the idea of it. What I meant is that the system can hold all pages that you don't change in its page cache and share this cache between processes. This is true for two situations: (1) as long as you map things read-only or (2) if you use copy-on-write and you don't change the contents. So in general if you think that you need random access to the whole contents of a file, `mmap` is almost always the better strategy. `fread` and variants should be limited to cases you only need partial access to the file at a given time. – Jens Gustedt Aug 01 '10 at 12:07

Reading all content from a text file - C

6 Answers6

Linked