Read sequential file - Compressed file vs Uncompressed

Question

I am looking for the fastest way to read a sequential file from disk. I read in some posts that if I compressed the file using, for example, lz4, I could achieve better performance than read the flat file, because I will minimize the i/o operations.

But when I try this approach, scanning a lz4 compressed file gives me a poor performance than scanning the flat file. I didn't try the lz4demo above, but looking for it, my code is very similar.

I have found this benchmarks: http://skipperkongen.dk/2012/02/28/uncompressed-versus-compressed-read/ http://code.google.com/p/lz4/source/browse/trunk/lz4demo.c?r=75

Is it really possible to improve performance reading a compressed sequential file over an uncompressed one? What am I doing wrong?

It all kinda depends on what hardware you have, your design and what your overall intent is. Do have an ancient, slow spinner, a network drive or a PCIe SSD? Do you need to start processing data with the minimum latency or do you want the overall file operation to be completed in the minimum time? Do you have a monster 32-core server with 128GB RAM or a cheap laptop? Does you software design allow the reading of one large buffer while another is being concurrently processed? All these factors.. best to just try it and see if it's faster. — Martin James, Nov 05 '13 at 12:38
7200 rpm drive. The overall file operation to be completed in the minimum time. Cheap laptop. Not yet. In this example, the operations are syncronous. Read the compressed buffer, decompressed it, process it. — p.magalhaes, Nov 05 '13 at 12:45
I compiled the lz4demo, and use it. Same result of my implementation. — p.magalhaes, Nov 05 '13 at 13:34

score 2 · Accepted Answer · answered Nov 12 '13 at 13:45

Yes, it is possible to improve disk read by using compression.

This effect is most likely to happen if you use a multi-threaded reader : while one thread reads compressed data from disk, the other one decode the previous compressed block within memory.

Considering the speed of LZ4, the decoding operation is likely to finish before the other thread complete reading the next block. This way, you'll achieved a bandwidth improvement, proportional to the compression ratio of the tested file.

Obviously, there are other effects to consider when benchmarking. For example, seek times of HDD are several order of magnitude larger than SSD, and under bad circumstances, it can become the dominant part of the timing, reducing any bandwidth advantage to zero.

score 0 · Answer 2 · answered Nov 05 '13 at 14:15

It depends on the speed of the disk vs. the speed and space savings of decompression. I'm sure you can put this into a formula.

Is it really possible to improve performance reading a compresses sequential file over an uncompressed one? What am i doing wrong?

Yes, it is possible (example: a 1kb zip file could contain 1GB of data - it would most likely be faster to read and decompress the ZIP).

Benchmark different algorithms and their decompression speeds. There are compression benchmark websites for that. There are also special-purpose high-speed compression algorithms.

You could also try to change the data format itself. Maybe switch to protobuf which might be faster and smaller than CSV.

Thanks for the reply. I am using text fixed length file. I will run a profiler on my code, to find bottlenecks. — p.magalhaes, Nov 05 '13 at 15:53
You can also try to parallelize the CPU work across cores by writing the file in fixed-size segments that you can individually decompress. — usr, Nov 05 '13 at 16:06

Read sequential file - Compressed file vs Uncompressed

2 Answers2

Linked