fast encode bitmap buffer to png using libpng

Question

My objective is to convert a 32-bit bitmap(BGRA) buffer into png image in real-time using C/C++. To achieve it, i used libpng library to convert bitmap buffer and then write into a png file. However it seemed to take huge time (~5 secs) to execute on target arm board (quad core processor) in single thread. On profiling, i found that libpng compression process (deflate algorithm) is taking more than 90% of time. So i tried to reduce it by using parallelization in some way. The end goal here is to get it done in less than 0.5 secs at least.

Now since a png can have multiple IDAT chunks, i thought of writing png with multiple IDATs in parallel. To write custom png file with multiple IDATs following methodology is adopted

   1. Write PNG IHDR chunk
   2. Write IDAT chunks in parallel
      i.   Split input buffer in 4 parts.
      ii.  compress each part in parallel using zlib "compress" function.
      iii. compute CRC of chunk { "IDAT"+zlib compressed data }.
      iv.  create IDAT chunk i.e. { "IDAT"+zlib compressed data+ CRC}.
      v.   Write length of IDAT chunk created.
      vi.  Write complete chunk in sequence.
   3. write IEND chunk

Now the problem is the png file created by this method is not valid or corrupted. Can somebody point out

What am I doing wrong?
Is there any fast implementation of zlib compress or multi-threaded png creation, preferably in C/C++?
Any other alternate way to achieve target goal?

Note: The PNG specification is followed in creating chunks

Update: This method works for creating IDAT in parallel

    1. add one filter byte before each row of input image. 
    2. split image in four equal parts. <-- may not be required passing pointer to buffer and their offsets
    3. Compress Image Parts in parallel
            (A)for first image part
                --deflateinit(zstrm,Z_BEST_SPEED)
                --deflate(zstrm, Z_FULL_FLUSH)
                --deflateend(zstrm)
                --store compressed buffer and its length
                --store adler32 for current chunk, {a1=zstrm->adler} <--adler is of uncompressed data
            (B)for second and third image part
                --deflateinit(zstrm,Z_BEST_SPEED)
                --deflate(zstrm, Z_FULL_FLUSH)
                --deflateend(zstrm)
                --store compressed buffer and its length
                --strip first 2-bytes, reduce length by 2
                --store adler32 for current chunk zstrm->adler,{a2,a3 similar to A} <--adler is of uncompressed data
            (C) for last image part
                --deflateinit(zstrm,Z_BEST_SPEED)
                --deflate(zstrm, Z_FINISH)
                --deflateend(zstrm)
                --store compressed buffer and its length
                --strip first 2-bytes and last 4-bytes of buffer, reduce length by 6
                --here last 4 bytes should be equal to ztrm->adler,{a4=zstrm->adler} <--adler is of uncompressed data

    4. adler32_combine() all four parts i.e. a1,a2,a3 & a4 <--last arg is length of uncompressed data used to calculate adler32 of 2nd arg
    5. store total length of compressed buffers <--to be used in calculating CRC of complete IDAT & to be written before IDaT in file
    6. Append "IDAT" to Final chunk
    7. Append all four compressed parts in sequence to Final chunk
    8. Append adler32 checksum computed in step 4 to Final chunk
    9. Append CRC of Final chunk i.e.{"IDAT"+data+adler}

    To be written in png file in this manner: [PNG_HEADER][PNG_DATA][PNG_END]
    where [PNG_DATA] ->Length(4-bytes)+{"IDAT"(4-bytes)+data+adler(4-bytes)}+CRC(4-bytes)

possible duplicate of [Parallelization of PNG file creation with C++, libpng and OpenMP](http://stackoverflow.com/questions/10827247/parallelization-of-png-file-creation-with-c-libpng-and-openmp) — timrau, Mar 12 '14 at 15:29
@timrau i have seen the post mentioned earlier. In that post the author has implemented compress and created only single IDAT chunk in png file, while in my case i am trying to prallelize and write multiple IDATs. So i want to know what is the correct way of writing png file with multiple IDATs in parallel? — Prashant Ranjan, Mar 12 '14 at 15:37
Comments on steps: You don't need step 3, since those are already computed in each thread, and are the three sets of four bytes you are stripping off the end. Just don't discard those. Then the current step 4 would be moved after the current step 5. — Mark Adler, Mar 13 '14 at 17:43
You should show your code for `deflateInit`, `deflate`, `deflateEnd`. — Mark Adler, Mar 13 '14 at 17:46
I don't understand what exactly you mean in steps 6-9. Also note that you need a chunk CRC. — Mark Adler, Mar 13 '14 at 17:48
I have not shared source because of restrictions, however you can check my [partial source](http://pastebin.com/0svMZrT5). Thanks. — Prashant Ranjan, Mar 14 '14 at 04:32
@Mark i have changed my implementation to store last 4 bytes of zlib chunks in all 4 threads and then combine using adler32_combine(). It seems that the adler32 of uncompressed data is not same as adler checksum computed. Please check this: [0] Last 4 Bytes 0 0 255 255 adler32_combine before: 1 after: 14 , [1] Last 4 Bytes 0 0 255 255 before: 14 after: 3407899, [2] Last 4 Bytes 0 0 255 255 before: 3407899 after: 10223656, [3] Last 4 Bytes 73 246 46 178 before: 10223656 after: 1261317849, Original adler32: 1335221251 Calculated: 1261317849. Is it right? — Prashant Ranjan, Mar 14 '14 at 10:45
Ah, my mistake. When you do a full flush, the last four bytes are _not_ the Adler-32. Do not strip the last four bytes of the full flushed streams, only the finished stream. The Adler-32 is still being computed for the full flush streams, and it can be retrieved from `strm->adler`. — Mark Adler, Mar 14 '14 at 16:06
Yes I realised it and modified my code. finally it is working fine.Thanks a lot for your precious time. However I am having another problem .The compressed PNG image on decoding is not appearing correctly. The compressed data for 2048*2048*4 = 16777216 bytes is 8671723 bytes, after again decompression it comes to be 16804499 I.e. I am getting extra 27283 bytes. what could be the reason? I am again using libpng's PNG_read_image() to read compressed PNG image. — Prashant Ranjan, Mar 15 '14 at 06:35

Glenn Randers-Pehrson · Accepted Answer · 2014-03-12T18:56:55.807

4

Even when there are multiple IDAT chunks in a PNG datastream, they still contain a single zlib compressed datastream. The first two bytes of the first IDAT are the zlib header, and the final four bytes of the final IDAT are the zlib "adler32" checksum of the entire datastream (except for the 2-byte header), computed before compressing it.

There is a parallel gzip (pigz) under development at zlib.net/pigz. It will generate zlib datastreams instead of gzip datastreams when invoked as "pigz -z".

For that you won't need to split up your input file because the parallel compression happens internally to pigz.

edited Mar 12 '14 at 18:56

answered Mar 12 '14 at 17:46

Glenn Randers-Pehrson

11,940
3
37
61

Thanks for replying Glenn. As i understand, the compressed zlib datastream has 2-byte header and 4-byte trailer.If i strip both of them while compressing in parallel and at last combine all of them. Lastly add 2-byte zlib header and calculate & append 4-byte adler32 checksum manually, will it be a valid png? – Prashant Ranjan Mar 13 '14 at 03:00
Yes, that's how I understand it. Strip the 4-byte trailer from the first segment. Strip both the 2-byte header and the 4-byte trailer from the rest of the segments. Compute the adler32 checksum over your original complete datastream (probably during the splitting pass). Append that 4-byte checksum to the last segment. Then start each segment with length and "IDAT", and a crc32 checksum at the end of each segment. Poke around the source files in my "pngzop" project at SourceForge (a subdirectory of the "pmt" project), especially the pngzop_zlib_to_idat.c program that reassembles the IDAT. – Glenn Randers-Pehrson Mar 13 '14 at 03:34
If i understand you correctly the adler32 checksum does not require zlib compressed data, it can be computed earlier for complete input buffer and appended to last IDAT segment directly? – Prashant Ranjan Mar 13 '14 at 04:54
No, you can't take separately created deflate streams and concatenate them to make a single deflate stream. The first one has the last bit set on the last block, which ends the decompression. – Mark Adler Mar 13 '14 at 05:00

Mark Adler · Answer 2 · 2014-03-14T16:08:06.777

2

In your step ii, you need to use deflate(), not compress(). Use Z_FULL_FLUSH on the first three parts, and Z_FINISH on the last part. Then you can concatenate them to a single stream, after pulling off the two-byte header from the last three (keep the header on the first one), and pulling the four-byte check values off of the last one. For all of them, you can get the check value from strm->adler. Save those.

Use adler32_combine() to combine the four check values you saved into a single check value for the complete input. You can then tack that on to the end of the stream.

And there you have it.

edited Mar 14 '14 at 16:08

answered Mar 13 '14 at 05:06

Mark Adler

101,978
13
118
158

1. Do you mean that i can not write separately zlib compressed data chunks into separate IDATs? 2. Are the last 4-byte adler checksum of {zlib 2-byte header+data} or only {zlib data} part? 3. If say i have 2 adler sums s1 & s2, will adler32_combine() return same final value if i pass s1 before s2 and vice-versa? 4. if i pass Z_FULL_FLUSH or Z_FINISH to deflate, the resulting compressed buffers differ by one bit only? Also will the adler checksum in both case be same? – Prashant Ranjan Mar 13 '14 at 06:16
1. Yes. As Glenn noted, the separate IDATs combined are a single zlib stream. – Mark Adler Mar 13 '14 at 06:19
2. Neither. The checksum is of the _uncompressed_ data. – Mark Adler Mar 13 '14 at 06:19
3. I don't get what you mean by vice-versa. You can compute `a = adler32(A)`, `b = adler32(B)`, then `adler32_combine(a, b, len(B))` will give the same thing as `adler32(AB)`. Here `A` and `B` are sequences of bytes, and `AB` is their concatenation. – Mark Adler Mar 13 '14 at 06:22
4a. No. The `Z_FULL_FLUSH` will write an extra empty block at the end to bring the deflate stream to a byte boundary. – Mark Adler Mar 13 '14 at 06:23
4b. Yes. The Adler-32 is of the uncompressed data. – Mark Adler Mar 13 '14 at 06:23
Thanks a lot for clarifying my doubts Mark. I had misconception that the checksum is of compressed data, that is why i asked so many similar questions. I will try to compress as per your suggestion. – Prashant Ranjan Mar 13 '14 at 06:34
Note that compressing parts of the stream separately will undoubtedly result in a larger output, because the second and subsequent pieces are compressed without the dictionary built from the previous input. It might be possible to mitigate that by using part 1 (uncompressed) as a "preset dictionary" for part 2, etc., but the bookkeeping would pretty intense. – Glenn Randers-Pehrson Mar 13 '14 at 12:51
Thanks Glenn -- that is an important point. Once Prashant gets what has been described working, then it would be time to graduate to using `deflateSetDictionary()` to improve the compression. It's not terribly complicated. You just need to feed the last 32K of part _n_ as the preset dictionary of part _n+1_. That is what pigz does. – Mark Adler Mar 13 '14 at 13:53
@Mark i did what you suggested. However when i tried to read the same image using libpng it threw error "invalid stored block lengths". – Prashant Ranjan Mar 13 '14 at 15:33
Provide more details of what you did in the question. – Mark Adler Mar 13 '14 at 15:40
@Glenn i am not worried about image size much right now, because for my current set of test images, i have not observed too much difference in size. My test image is of size 2048x2048x4 --> compression in single thread gives size of around 8.3 MB. While compressing in four parallel threads is giving almost same size file (diff is ~200-400 KB). I have turned off filtering currently and using Z_BEST_SPEED as level. – Prashant Ranjan Mar 13 '14 at 15:43
@Mark i have updated original post.please take a look and let me know if something is missing or wrong – Prashant Ranjan Mar 13 '14 at 17:24

fast encode bitmap buffer to png using libpng

2 Answers2