I've recently been playing around with CloudFlare's optimized zlib, and the results are really quite impressive.
Unfortunately, they seem to have assumed development of zlib was abandoned, and their fork broke away. I was eventually able to manually rebase their changes on to the current zlib development branch, though it was a real pain in the ass.
Anyway, there's still one major optimization in the CloudFlare code I haven't been able to utilize, namely, the fast CRC32 code implemented with the PCLMULQDQ carry-less multiplication instructions included with newer (Haswell and later, I believe) Intel processors, because:
I'm on a Mac, and neither the clang integrated assembler nor Apple's ancient GAS understand the newer GAS mnemonics used, and
The code was lifted from the Linux kernel and is GPL2, which makes the whole library GPL2, and thereby basically renders it useless for my purposes.
So I did some hunting around, and after a few hours I stumbled onto some code that Apple is using in their bzip2: handwritten, vectorized CRC32 implementations for both arm64 and x86_64.
Bizarrely, the comments for the x86_64 assembly are (only) in the arm64 source, but it does seem to indicate that this code could be used with zlib:
This function SHOULD NOT be called directly. It should be called in a wrapper
function (such as crc32_little in crc32.c) that 1st align an input buffer to 16-byte (update crc along the way),
and make sure that len is at least 16 and SHOULD be a multiple of 16.
But I unfortunately, after a few attempts, at this point I seem to be in a bit over my head. And I'm not sure how to actually do that. So I was hoping someone could show me how/where one would call the function provided.
(It also would be fantastic if there were a way to do it where the necessary features were detected at runtime, and could fall back to the software implementation if the hardware features are unavailable, so I wouldn't have to distribute multiple binaries. But, at the very least, if anyone could help me suss out how to get the library to correctly use the Apple PCLMULQDQ-based CRC32, that would go a long way, regardless.)