How to parse an entropy-coded JPEG block efficiently?

Question

I'm just trying to jump through a SOS_MT block in a .JPEG file, I don't want to use the data for anything, I just want to know where it ends. According to what I understand from JPEG's article in Wikipedia, while all other blocks in the JPEG file start with a few bytes that indicate the blocks's length, a SOS_MT block is ... well, an evil swamp that you have no option but to parse byte-by-byte until you get to the end of it.

So I came with the following code to do just that:

entropyCoded :: Parser Int
entropyCoded = do
    list_of_lengths <-  many' $
         (
           do
             _ <- notWord8 0xFF
             return 1
         )
         <|>
         (
           do
             _ <- word8 0xFF
             _ <- word8 0
             return 2
         )
         <|>
         (
           do
             l <- many1 (word8 0xFF)
             _ <- satisfy (\x -> ( x >= 0xD0 && x < 0xD7 ))
             return $ 1 + length l
         )
         <|>
         (
           do
             _ <- word8 0xFF
             maybe_ff <- peekWord8'
             if maybe_ff == 0xFF
               then
                 return 1
               else
                 fail "notthere"
         )
    foldM (\ nn n -> nn `seq` return (nn + n) ) 0 list_of_lengths

This code uses Atoparsec and as far as I have had the chance to verify it, it is correct. It is just slow. Any tips on how to improve, performance-wise, this parser?

The compressed image data is sandwiched between FFDA and FFD9. The problem is that the FFD9 may be missing and the file should end there anyway. I don't understand your purpose in "jumping 'through' a SOS_MT' block. There is no 'through' and nothing 'on the other side'. — BitBank, Feb 26 '16 at 08:13
Firstly, (potentially) by using a library designed for binary serialization (I'm partial to [cereal](https://hackage.haskell.org/package/cereal)). Secondly, each branch parses a word unconditionally. Instead of parsing that word, and backtracking if you fail later, you should parse that word and decide what you are going to do next based on its value. Lastly (not certain about this one) instead of computing and summing the length, use [`match`](https://hackage.haskell.org/package/attoparsec-0.13.0.1/docs/Data-Attoparsec-ByteString.html) to get the parsed string, them simply compute its length. — user2407038, Feb 26 '16 at 08:15
@BitBank Forgive me if I didn't understand correctly, but what about multiple passes for progressive .JPEG? Don't each pass uses a different SOS_MT block? — dsign, Feb 26 '16 at 08:21
Thanks @user2407038, your suggestion about factoring out the word parsing is good. — dsign, Feb 26 '16 at 08:23
For progressive images, there's a SOS (FFDA) at the start of each scan. This byte sequence (FFDA) can only be interpreted as an SOS marker; compressed data with a sequence of FFDA would be encoded as FF00DA. — BitBank, Feb 26 '16 at 09:20

score 3 · Accepted Answer · answered Feb 26 '16 at 20:34

If you want to skip over an SOS market, just look for the next marker that is not a restart marker.

Read bytes until you find and FF. If the next value 00, it is a compressed FF value and skip over it. If it's a restart marker skip over it. Otherwise, FF should start the next block.

score 1 · Answer 2 · answered Nov 17 '18 at 05:10

a small addition to the previous answer, according to the ISO/IEC 10918-1 : 1993(E) standard:

B.1.1.5 Entropy-coded data segments

An entropy-coded data segment contains the output of an entropy-coding procedure. It consists of an integer number of bytes, whether the entropy-coding procedure used is Huffman or arithmetic.

NOTE 1

Making entropy-coded segments an integer number of bytes is performed as follows: for Huffman coding, 1-bits are used, if necessary, to pad the end of the compressed data to complete the final byte of a segment. For arithmetic coding, byte alignment is performed in the procedure which terminates the entropy-coded segment (see D.1.8).

NOTE 2

In order to ensure that a marker does not occur within an entropy-coded segment, any X'FF' byte generated by either a Huffman or arithmetic encoder, or an X'FF' byte that was generated by the padding of 1-bits described in NOTE 1 above, is followed by a "stuffed" zero byte (see D.1.6 and F.1.2.3).

so, when you enconter 0xFF in the entropy-encoded part at position N, read forward one more byte. if the next byte is 0x00 then it's a "stuffed" zero. if it's another 0xFF then there is a padding, re-check from N+1. every other byte (0x01-0xFE) is part of the next marker.

How to parse an entropy-coded JPEG block efficiently?

2 Answers2