BinDCT implementation for a 32x32 matrix

Question

So I am playing a bit with DCT implementations and noticed they are (relative) slow due to the necessary multiplier calculations.

After googling a bit, I came across BinDCT, which results in very good approximations of the DCT and only uses bit shifts.

While scanning a paper about it (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.7.834&rep=rep1&type=pdf and http://www.docstoc.com/docs/130118150/Image-Compression-Using-BinDCT) and reading some code I found on ohlo (http://code.ohloh.net/file?fid=vz-HijUWVLFS65NRaGZpLZwZFq8&cid=mt_ZjvIU0Us&s=&fp=461906&projSelected=true#L0), I noticed there are only implementations for a 8x8 matrix.

I am looking for an implementation of this BinDCT for a 32x32 matrix so I can use it in a faster variation of the perceptual hash algorithm (phash).

I am no mathematician and although I tried to understand what's going on in the paper and the c code I found I just can't wrap my head around how to transform this implementation to apply to a 32x32 matrix.

Has anyone ever written one? Is it even possible?

I understand that extending the implementation requires a lot more bit shifting and tmp variables. But although I could try with trial and error, I don't even understand the theory, so I would never know if I get the correct result.

I am writing this in C#, but any language would suffice as it's all basic operations and can be easily translated.

score 1 · Answer 1 · answered Jan 31 '14 at 10:15

1.you have fixed input size

so you multiply by the same weights all the time
pre-compute them once and then use only them
this ditch all sin,cos operations

2.2D DCT can be computed as 1D DCT (similar to FFT)

first do DCT on rows
then on collumns of the DCTed rows
multiply by normalization constant
so this converts O(N^4) to O(N^3)

3.use FastDCT

well this is very tricky
Fast algorithm is fusion between (I)DST and (I)DCT
there are few papers about it
but there are vague (and all equations are different in different papers and not whole)
I actually newer see a functional equation nor program for it
the only almost functional approach is by use of FFT
but for small N is there no gain because of switching to complex domain
and the values are not really a DCT but a close approximation to it.
of course I am no expert in this field so I can overlooked something
in all that hundreds of paper pages equations
anyway after Fast algorith implementation the 2D (I)DCT and the bullet 2
is complexity around O((N^2).log(N))

4.ditching the FPU multiplications

you can take all the weights and convert them to a1=a0*1024
or any other mask
so:
```
x*a0 = (x*a1)/1024 = (x*a1)>>10
```
the same can be done for input data
so now just integer operations remains
but on modern machines can be this approach slower then FPU usage (depends on platform and implementation)

4.ditching integer multiplications

you can ditch all multiplications by shift and add operations (look for binary multiplication)
but on modern machines will this actually slow things down
of course if you are wiring this on some logic board/IO then it has its merit

Hey! thank for the response. 1. yeah I'm doing this I have a precomputed 32x32 matrix. 2. Tried this though I got weird results, probably a mistake in my code though. 3. uuuuurh. 4. Tried it but adds more overhead then just working with floats. 5. this is indeed (very) slower. I'm actually pretty happy with how it performs at the moment so I didn't look further really. (I can create 1000 phashes in 447ms). Don't have numbers on the actual dct calculation but it's neglectable compared to the rest of the application. +1 for looking into this! — Remco Ros, Jan 31 '14 at 10:52
glad to be of help. btw for bullet 2 you need to multiply the whole result by normalization constant dependent on the input data size and DCT type used. if you got something wrong then its probably only wrong constant. I am using normalized transformation (magnitude is not changing) so the constant for me is usually c=1/N — Spektre, Jan 31 '14 at 16:09

score 0 · Answer 2 · answered Jan 14 '14 at 04:17

0

My only understanding of applying matrices is related to manipulating 3D vectors so I don't know the answer to your question directly. But in looking around, I did find this link to a blog where your specific issue is addressed. The comments at the bottom are from a bunch of people that could be a good pool of resources to chat with who have knowledge in this area. Also, If you follow the links there is a lot of good image compression info.

The author appears to be heavily involved in photo forensics. He explains how pHash is more robust than the average hash and mentions using a 32 x 32 matrix.

This could be a really good starting point. Take care.

http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html

answered Jan 14 '14 at 04:17

drankin2112

4,715
1
15
20

This blog post was actually my initial starting point when starting looking into perceptual image hashing. From there I've read the c++ code of phash and converted it to C#. The main issue I have though, is that they use the DCT algorithm, which is (relative) slow because of the multiply calculations needed. Because I need a very very high performance algorithm, which is also very accurate. I started looking into near approximations of the DCT algorithm, which led me to BinDCT. Thanks though :) – Remco Ros Jan 14 '14 at 08:54
Were you able to make a 32x32 version of the DCT algorithm before looking into the binDCT version? Also, I've done extensive testing of bitwise computations in C# and it is a rare case indeed where a significant performance increase can be seen vs using regular expression operators. It's because in C# we're not working with the memory directly but through IL. After compiler optimization, the IL is nearly always comparable in speed. My point is that the bitwise operations in the binDCT code you linked to is less important in C# as is the fact that you're not doing floating arithmetic. – drankin2112 Jan 14 '14 at 16:47
I have a working implementation of phash in C# at the moment yes. I compared it to pinvoking a native c++ dll and it doesn't really gain anything. Using a profiler I see that ~80 % of the time spent in the dct function comes from the for(for(for() multiply needed. That's a feature of the algorithm you can't avoid. BinDCT doesn't need multiplications only bit shifts and additions. I optimized the dct function already to the point that it uses jagged arrays. converting it to IL (or PInvoking a native dll) would probably not even give me a 1% speed gain, so that's why I am looking for a complete – Remco Ros Jan 14 '14 at 16:59
-- complete different algorithm (ditching the multiplications) – Remco Ros Jan 14 '14 at 16:59
From what I read, BinDCT is a very good approximation of the DCT, but I just don't have the scientific experience to apply the articles / code I found to a 32x32 matrix. Some background info: I use this in an image recognition tool, which captures screenshots of a program and compares ROI's against a known list of hashes. The whole process of screen grabbing, then calculating it's phash and comparing it to the known hash takes around 130 ms. on my (i7) machine. with around 80% time spent in the DCT calculation. I was trying to optimize this, so I can grab screenshots faster. – Remco Ros Jan 14 '14 at 17:03
I probably have to find a different solution. Like grabbing the screenshots in multiple threads, and let multiple (dct) hash generations run at the same time. – Remco Ros Jan 14 '14 at 17:05
This is a very good question. +1 from me. I'll see if any of my 3d graphics matrix classes offer any insight. Simple matrix transformations and multiplication, regardless of application, can look pretty verbose. I got rid of the for(for(for() stuff and replaced it with index-based manipulations for speed but, we're only talking about 4 x 4 matrices. Can't imagine that with 32 x 32. Either way, good luck. I know there is a way to do what you're trying to do. – drankin2112 Jan 14 '14 at 17:12
I should probably post this on math exchange as well :) – Remco Ros Jan 14 '14 at 17:15
Ahh. we were typing at the same time. You can definitely get thread parallelism happening with the System.Threading.Task library. The Task class will isolate you're tasks to each of your 4 processors. – drankin2112 Jan 14 '14 at 17:15
1

Yes.. also add the algorithm tag to your post. The brainiacs with have lots to offer! – drankin2112 Jan 14 '14 at 17:16
What I did try was spawning the outer loop of the multiplication over multiple threads (using Task). But that spawns 32 threads and resulted in a BIG performance loss :D – Remco Ros Jan 14 '14 at 17:17
I would try to figure out why Task is causing performance loss. The i7 excels at multitasking vs single single core processes (like game loops). Maybe just use it on the inner process of all the for loops. It will create more threads but will have less of a problem with synchronizing.... Maybe :) – drankin2112 Jan 14 '14 at 17:23
If I can't find an alternative to DCT, I will probably just resort to making the actual screen grabbing process multi threaded (so a higher up the API). I don't think it's worth my investing THAT much time in optimizing a single thread's process. Especially for this tool (hearthstonetracker.com) :D I'm pretty happy with 130 ms. though. But I wonder how it performs on lower end machines. hmm. preliminary optimization... I should focus on real features haha – Remco Ros Jan 14 '14 at 17:42

BinDCT implementation for a 32x32 matrix

2 Answers2