1

I have fairly large text files (~1Gb) containing sequential data that I wish to parse (i.e. lines to be read, from top to bottom). These text files are compressed in the gzip format.

Currently, my basic implementation (I'm new to zlib, and haven't written in C for many years) to parsing these files is :

  1. uncompress the file using the zlib library and write it to disk (!)
  2. read from disk (!) the decompressed text file and parse it line by line

Hopefully, this can be improved as soon as I understand how to better use zlib (tips appreciated ;-) ) by doing :

  1. uncompress the file using the zlib library and keep contents in memory
  2. read file (from memory) and parse it line by line

However, I think this could be further optimized so as to parse the file "online" while decompressing. I believe gzip decompression is somewhat sequential so it might be possible to read the gzip file and, as soon as a line of text has been decompressed, send it to the parser ? This would avoid scanning the file twice and, possibly, also avoid keeping the decompressed file in memory.

Here is an answer that says it is possible and preferable to do it this way. Could you please show me how I could go about implementing (or using a lib that implements) such a program ?

Thanks,

Tepp.

Community
  • 1
  • 1
sg1234
  • 600
  • 4
  • 19
  • I think the link you have provided is containing a code doing it... – Eugene Sh. Jan 29 '16 at 18:56
  • Thanks for your comment. I could be mistaken, but the only code I have found so far that does this is "shell" syntax using pipe. As for the C-code provided in one of the links, it is for another compression algorithm (not gzip.) – sg1234 Jan 29 '16 at 19:16
  • 1
    While I'm sure it could be done with zlib, I wouldn't hesitate to just run `gunzip < file.gz | myprogram`, or if for some reason that was inconvenient, then `ifp = popen("gunzip < file.gz", "r")`. And then have the rest of the program read lines of text from `stdin` or `ifp` as usual. (But any solution along these lines does indeed assume a Unix-like shell environment is available, with pipes.) – Steve Summit Jan 29 '16 at 19:30
  • Yes, you can definitely write a program which uses zlib to decompress a file chunk by chunk and pass those chunks as they accumulate to code to parse them. In fact, the most logical way might be to use zlib routines as the parser's function to get the next character(s). – John Hascall Jan 29 '16 at 19:42
  • @SteveSummit thanks for your comment, I was hoping to avoid piping and to get a simple solution like the one I accepted, which seems to work well so far :) – sg1234 Jan 29 '16 at 20:49
  • 1
    @JohnHascall, thanks that is exactly what I was looking for and the answer has since been provided and accepted below. – sg1234 Jan 29 '16 at 20:49

2 Answers2

3

Yes. You don't even have to use popen() to do it; zlib includes a set of functions for doing exactly this:

#include <zlib.h>

gzFile fh = gzopen("file.gz", "rb");

char buf[1024];
char *line;
while ((line = gzgets(fh, buf, sizeof(buf)) != NULL) {
    // process line
}

gzclose(fh);

The same interface also supports writing gzip files a line at a time; see the documentation for details.

  • Thank you for your answer. The compiler complained about a couple of things so I changed the code a bit, here is what worked for me : ` gzFile fh = gzopen( "/home/file.txt.gz", "rb" ); char line[1024]; gzgets( fh, line, sizeof(line) ); while (line!=NULL) { //process line gzgets( fh, line, sizeof(line) ); } gzclose(fh); ` (sorry code is not showing correctly, despite the use of backticks) – sg1234 Jan 29 '16 at 20:39
  • basically : 1. changed gzFile pointer to a gzFile, and 2. let "line" be updated through passing it by ref instead of expecting it to be returned, which gave me the error "assignment to expression with array type". if this makes sense you may want to edit your answer accordingly ? – sg1234 Jan 29 '16 at 20:46
  • @teppyogi Yep; updated code to match. This is what happens when you write code in an answer without testing it :) You do need to check the return value of `gzgets()`, though; it'll return NULL when you hit the end of the file. –  Jan 29 '16 at 22:12
  • @chqrlie Whoops, you're right. Fixed! Again! –  Jan 29 '16 at 23:47
  • If `gzgets` return value is either its destination argument or `NULL`, it is useless to store the result into a separate `line` variable. – chqrlie Jan 30 '16 at 00:12
  • @chqrlie Eh, there's reasons. :) You can't pass `&buf` into `strsep()`, for instance. –  Jan 30 '16 at 01:21
  • @duskwuff: sure, but it is a tad confusing because the reader wonders if `line` can be different from `buf`. `gzgets`, like `fgets` returns its destination argument if successful, storing it to a pointer is unusual but not incorrect. – chqrlie Jan 30 '16 at 06:25
0

You can open a gzip compressed file via popen and read from the stream sequentially as if it was uncompressed, except you cannot seek into the stream.

Here is some code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char **argv) {
    char buffer[4096];
    char *cmd;
    int cmdsize;
    FILE *fp;
    int found = 0;

    if (argc < 3) {
        printf("usage: zgrep string file\n");
        return 2;
    }
    cmdsize = strlen("gunzip < ") + strlen(argv[2]) + 1;
    cmd = malloc(cmdsize);
    snprintf(cmd, cmdsize, "gunzip < %s", argv[2]);
    if ((fp = popen(cmd, "r")) == NULL) {
        perror("cannot run gunzip");
        return 1;
    }
    while (fgets(buffer, sizeof buffer, fp)) {
        if (strstr(buffer, argv[1])) {
            fputs(buffer, stdout);
            found = 1;
        }
    }
    fclose(fp);
    return found;
}
chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • thanks for your comment, I was hoping to avoid piping (popen seems to be a form of piping if I understand your sample code correctly ?) and to get a simple solution like the one I accepted, which seems to work well so far... – sg1234 Jan 29 '16 at 20:51