Distribution of bytes within jpeg files

Question

when observing compressed data, I expect an almost uniformely distributed byte stream. When using the chi square test for measure the distribution, I get this result e.g. for ZIP-files and other compressed data, but not for JPG-files. Last days I spent with finding reasons for this, but I cannot find any.

When calculating the entropy of JPGs, I get a high result (e.g. 7,95 Bits/Byte). I thought there must be a connection between the entropy and the distribution: the entropy is hight, when every byte appears with almost the same probability. But when using chi square, a get a p-value which is about 4,5e-5...

I just want to understand how different distributions influence the test results... I thought I can measure the same property with both tests, but obviously I can not.

Thank you very much for any hint! tom

Did you measure just the body or did you include the uncompressed header? — usr, Dec 04 '12 at 14:01
I tried to exclude metainformation... therefore I skipped the first and last 4096 bytes (1 cluster each). — tommynogger, Dec 04 '12 at 14:07
JPEG and many other formats have section headers and other metadata throughout the file, not just at the beginning and/or the end. If you really want to skip all metadata, you'll need to parse the header to figure out where other sections are so you can skip them as well... — twalberg, Dec 04 '12 at 15:47
i went through some files already but couldn't find anything which looks like metadata, everything looks uniform. — tommynogger, Dec 04 '12 at 19:04
What did you test with chi-square? The frequencies of the 256 byte values? — usr, Dec 04 '12 at 19:10

score 1 · Answer 1 · edited May 23 '17 at 12:31

Distribution in jpeg-files

Ignoring the meta-information and the jpeg-header-data, the payload of a jpeg consists of blocks describing huffmann-tables or encoded MCUs (Minimum-Coded-Units, square blocks of the size 16x16). There may be others but this are the most frequent ones.

Those blocks are delimited by 0xFF 0xSS, where 0xSS is a specific startcode. Here is the first problem: 0xFF is a bit more frequent as twalberg mentioned in the comments.

It may happen, that 0xFF occur in an encoded MCU. To distinguish between this normal payload and the start of a new block, 0xFF 0x00 is inserted. If the distribution of unstuffed payload is perfectly uniform, 0x00 will be twice as often in the stuffed data. To make bad things worse, every MCU is filled up with binary ones to get byte-alignment (a slight bias to larger values) and we might need stuffing again.

There may be also some other factors I'm not aware of. If you need more information you have to provide the jpeg-file.

And about your basic assumption:

for rand_data:

 dd if=/dev/urandom of=rand_data count=4096 bs=256

for rand_pseudo (python):

s = "".join(chr(i) for i in range(256))
with file("rand_pseudo", "wb") as f:
    for i in range(4096):
        f.write(s)

Both should be uniform regarding byte-values, shouldn't they? ;)

$ ll rand_*
-rw-r--r-- 1 apuch apuch 1048576 2012-12-04 20:11 rand_data
-rw-r--r-- 1 apuch apuch 1048967 2012-12-04 20:13 rand_data.tar.gz
-rw-r--r-- 1 apuch apuch 1048576 2012-12-04 20:14 rand_pseudo
-rw-r--r-- 1 apuch apuch    4538 2012-12-04 20:15 rand_pseudo.tar.gz

A uniform distribution might indicate a high entropy but its not a guarantee. Also, rand_data might consists out of 1MB of 0x00. Its extremely unlikely, but possible.

thank you very much! obviously i need to have a deeper understanding of the jpg-file format. anyway, i am still confused why the entropy is high, but the p-value from the chi-square calculation is very low (much lower than for ZIP/DOC/PDF)... — tommynogger, Dec 05 '12 at 11:17

score 0 · Answer 2 · answered Dec 05 '12 at 17:02

0

Here you can find two files: the first one is random data, generated with dev/unrandom (about 46MB), the second one is a normal JPG file (about 9MB). It is obvious that the symbols of the JPG-file are not as equally distributed as in dev/urandom.

If I compare both files:

Entropy: JPG: 7,969247 Bits/Byte RND: 7,999996 Bits/Byte

P-Value of chi-square test: JPG: 0 RND: 0,3621

How can the entropy lead to such a high result?!?

Random Data (dev/urandom JPG

answered Dec 05 '12 at 17:02

tommynogger

181
7

i'd like to push that post because i still haven't found an answer... happy new year everybody btw :) – tommynogger Jan 01 '13 at 13:49

score 0 · Answer 3 · answered Mar 09 '14 at 08:59

Here is my java code

         public static double getShannonEntropy_Image(BufferedImage actualImage){
         List<String> values= new ArrayList<String>();
           int n = 0;
           Map<Integer, Integer> occ = new HashMap<>();
           for(int i=0;i<actualImage.getHeight();i++){
             for(int j=0;j<actualImage.getWidth();j++){
               int pixel = actualImage.getRGB(j, i);
               int alpha = (pixel >> 24) & 0xff;
               int red = (pixel >> 16) & 0xff;
               int green = (pixel >> 8) & 0xff;
               int blue = (pixel) & 0xff;
//0.2989 * R + 0.5870 * G + 0.1140 * B greyscale conversion
//System.out.println("i="+i+" j="+j+" argb: " + alpha + ", " + red + ", " + green + ", " + blue);
                int d= (int)Math.round(0.2989 * red + 0.5870 * green + 0.1140 * blue);
               if(!values.contains(String.valueOf(d)))
                   values.add(String.valueOf(d));
               if (occ.containsKey(d)) {
                   occ.put(d, occ.get(d) + 1);
              } else {
                  occ.put(d, 1);
              }
              ++n;
       }
    }
    double e = 0.0;
    for (Map.Entry<Integer, Integer> entry : occ.entrySet()) {
         int cx = entry.getKey();
         double p = (double) entry.getValue() / n;
         e += p * log2(p);
    }
 return -e;
  }

Distribution of bytes within jpeg files

3 Answers3

Distribution in jpeg-files

And about your basic assumption: