I have fairly large text files (~1Gb) containing sequential data that I wish to parse (i.e. lines to be read, from top to bottom). These text files are compressed in the gzip format.
Currently, my basic implementation (I'm new to zlib, and haven't written in C for many years) to parsing these files is :
- uncompress the file using the zlib library and write it to disk (!)
- read from disk (!) the decompressed text file and parse it line by line
Hopefully, this can be improved as soon as I understand how to better use zlib (tips appreciated ;-) ) by doing :
- uncompress the file using the zlib library and keep contents in memory
- read file (from memory) and parse it line by line
However, I think this could be further optimized so as to parse the file "online" while decompressing. I believe gzip decompression is somewhat sequential so it might be possible to read the gzip file and, as soon as a line of text has been decompressed, send it to the parser ? This would avoid scanning the file twice and, possibly, also avoid keeping the decompressed file in memory.
Here is an answer that says it is possible and preferable to do it this way. Could you please show me how I could go about implementing (or using a lib that implements) such a program ?
Thanks,
Tepp.