4

I am trying to optimise the run length coding. I was thinking of implementing it in SIMD. I spent a few hours working on the algo but couldn't proceed much. Is it worth giving it a shot? I am working on Neon.

Thanks.

osgx
  • 90,338
  • 53
  • 357
  • 513
Rugger
  • 373
  • 3
  • 10
  • 2
    Please give more details about your case. What is compressed, how your rle is encoded. May be add some example code (both w/o simd and with)? What is your CPU? What is your current speed (in MB/s) or latency for typical sized input and what is the goal. – osgx Feb 16 '13 at 13:15
  • 2
    I am trying to optimize the run length coding part of JPEG compression.The major portion of the time taken by the JPEG encoding is this RLE. I am working on ARM platform. It has NEON simd engine. – Rugger Feb 16 '13 at 14:25
  • yes.I am working on Cortex A9 – Rugger Feb 16 '13 at 19:30
  • What is your JPEG implementation? – osgx Feb 17 '13 at 13:32

1 Answers1

7

I'll assume that your issue is with RLE in AC block coefficients encoding in JPEG.

Variant of RLE, used in JPEG is quite specific. Each block of 8x8 pixels is transformed with DCT and quantised. Then DCT output (64 coefficients) is separated for first DC coefficient and 63 AC coefficients. 8x8 block of AC coefs is converted to linear array using Zig-Zag pattern, and then RLE-encoded

ZigZag from wikimedia (File:JPEG_ZigZag.svg, by Alex Khristov, PD)

ZigZag is usually combined with RLE run, so we have RLE with non-linear memory access and should optimize both functions.

WARNING! RLE in JPEG is used only for zero elements!, check F.1.2.2.1, F.1.2.2.3 and Figure F.2 of ISO/IEC 10918-1 : 1993 standard.

Some implementations:

There is libjpeg-tubro (Homepage is libjpeg-turbo.org), and some fork of it was optimized by Linaro for ARM in 2010-2011.

There is list of optimizations in this fork:

  • 10 - ARM NEON optimizations for quantization code in 'forward_DCT' (Integer divisions replaced with floating point multiplications. ... This optimized function is now almost 3x faster than original C variant - jcdctmgr.c - Siarhei Siamashka 2010-11-10
  • 9 - ARM assembly optimizations for 'encode_one_block' (ARM assembly optimizations for 'encode_one_block'. Almost 2x faster than original C variant.) - jchuff.c - Siarhei Siamashka 2010-11-10
  • 8 - ARM NEON optimized version of 'rgb_ycc_convert' - Siarhei Siamashka 2010-11-10
  • 7 - ARM NEON optimized version of 'ycc_rgb_convert' - Siarhei Siamashka 2010-11-10
  • 6 - A minor ARM NEON optimization for 'convsamp' - Siarhei Siamashka 2010-11-10
  • 5 - ARM NEON optimized version of 'jpeg_fdct_ifast' - Siarhei Siamashka 2010-11-10
  • 4 - ARM NEON optimized version of 'jpeg_idct_ifast' - Siarhei Siamashka 2010-11-10
  • 3 - ARM NEON optimized version of 'jpeg_idct_4x4' - Siarhei Siamashka 2010-11-10

Revision 9 was ARM optimization of jchuff.c file, for function encode_one_block, which encodes both AC and DC components of block; but it is done without using NEON

/* Encode a single block's worth of coefficients */
LOCAL(boolean)   encode_one_block 

Factually, RLE was not optimized; but ZigZag and last zero coef detection was implemented in ARM assembly in the find_last_nonzero_index helper function. (It was generated using ruby generator) or in linaro git. It was scheduled for dual-issue Cortex-A8:

* Find last nonzero coefficient and produce output in natural order,
* instructions are scheduled to make use of ARM Cortex-A8 dual-issue
* capability

This is the corresponding C code of function:

LOCAL(int)
find_last_nonzero_index (JCOEFPTR block, JCOEFPTR out)
{
  int tmp, i, n = 0;
  for (i = 1; i < DCTSIZE2; i++) {
    if ((tmp = block[jpeg_natural_order[i]]) != 0)
      n = i;
    out[i] = tmp;
}
return n;

There is ARM or NEON optimizations for RLE itself here, but I think this asm code helps RLE, converting it to linear memory access version (encode_one_block, under ifdef __arm__):

  for (k = 2; k <= last_nonzero_index; k += 2) {
      innerloop(k);
  }

r used in innerloop macro is the RLE counter.

This linear RLE with hand-coded ZigZag (for Cortex-A8) should be faster than original C version even for Cortex-A9. Thanks to Siarhei Siamashka!

PS: Current version of libjpeg-turbo has no arm optimizations in this RLE: encode_one_block, Line 454 of jchuff.c, rev929 - innerloop was slightly rewritten as kloop, but RLE is still done in non-linear way; zigzag not separated from it.

Some thoughts about NEON

NEON has set of 32 registers each 64-bit wide (D0..D31; according to Anderson @ ELC 2011, page 5) and, in theory, can be used to store 64 of 16-bit coefficients and to implement ZigZag+RLE. Still searching for implementations...

There is similar zig-zag + RLE in MPEG standards, and there are some efforts of SIMD implementation of it for x86 and arm. There is a blog post in x264dev; 8x8 zigzag x264_zigzag_scan_8x8_frame was implemented for x86 with SSSE3. But for ARM there is only 4x4 NEON zigzag in x264.

PS: Just for me and anybody who don't know Jpeg internals. There is short and understandable introduction to JPEG coding in Cardiff CM0340 lecture slides.

PPS (updated at February 18, 2013, 13:30): For optimizing RLE encoding, we can do a pre-scan searching for zeroes in middle of AC coefficients and then work with this precalculated data. We can even save it as half-bytes or as bits in some NEON regs

UPDATE February 18: Patch author says that comment for commit 9 is inaccurate. There was 2x improvement when this code was compared with jpeg6b, not libjpeg-turbo. He says that unroll by 63 (as in libjpeg-tubro) has almost same speed as this asm solution (on some tests it is slightly better on other not).

osgx
  • 90,338
  • 53
  • 357
  • 513
  • 1
    Thanks for your reply. I had already optimized the DCT, Quant and zigzag module in Neon and I was wondering if i could do something for the RLE part. I got few tips while I went through the opensource code. – Rugger Feb 18 '13 at 18:16
  • Is your work open-source? Are you started from libjpeg or libjpeg-turbo? What are the tips you got? – osgx Feb 18 '13 at 18:27
  • The last non zero function and zig zag scan can be implemented in NEON. But the real problem is going beoynd that. Whether the run can be parallely calculated in neon, so that the AC size and run encoding can be optimized. But, I feel its not possible in NEON, atleast the run part.So yeah., I am kinda stuck there. Do u have any tips on how to proceed on calculating runs in a SIMD way. Calculating runs can be done in some way but the most difficult part is to store neglect the zero coefficients and store only the runs and value of non zero coefficients using NEON. – Rugger Mar 02 '13 at 16:05
  • Rugger, can you publish your current code? You can add it to the question as update. – osgx Mar 02 '13 at 17:14