4

when observing compressed data, I expect an almost uniformely distributed byte stream. When using the chi square test for measure the distribution, I get this result e.g. for ZIP-files and other compressed data, but not for JPG-files. Last days I spent with finding reasons for this, but I cannot find any.

When calculating the entropy of JPGs, I get a high result (e.g. 7,95 Bits/Byte). I thought there must be a connection between the entropy and the distribution: the entropy is hight, when every byte appears with almost the same probability. But when using chi square, a get a p-value which is about 4,5e-5...

I just want to understand how different distributions influence the test results... I thought I can measure the same property with both tests, but obviously I can not.

Thank you very much for any hint! tom

tommynogger
  • 181
  • 7
  • Did you measure just the body or did you include the uncompressed header? – usr Dec 04 '12 at 14:01
  • I tried to exclude metainformation... therefore I skipped the first and last 4096 bytes (1 cluster each). – tommynogger Dec 04 '12 at 14:07
  • JPEG and many other formats have section headers and other metadata throughout the file, not just at the beginning and/or the end. If you really want to skip all metadata, you'll need to parse the header to figure out where other sections are so you can skip them as well... – twalberg Dec 04 '12 at 15:47
  • i went through some files already but couldn't find anything which looks like metadata, everything looks uniform. – tommynogger Dec 04 '12 at 19:04
  • What did you test with chi-square? The frequencies of the 256 byte values? – usr Dec 04 '12 at 19:10
  • I observed bytes, i.e. you have 256 different values. – tommynogger Dec 05 '12 at 11:16

3 Answers3

1

Distribution in jpeg-files

Ignoring the meta-information and the jpeg-header-data, the payload of a jpeg consists of blocks describing huffmann-tables or encoded MCUs (Minimum-Coded-Units, square blocks of the size 16x16). There may be others but this are the most frequent ones.

Those blocks are delimited by 0xFF 0xSS, where 0xSS is a specific startcode. Here is the first problem: 0xFF is a bit more frequent as twalberg mentioned in the comments.

It may happen, that 0xFF occur in an encoded MCU. To distinguish between this normal payload and the start of a new block, 0xFF 0x00 is inserted. If the distribution of unstuffed payload is perfectly uniform, 0x00 will be twice as often in the stuffed data. To make bad things worse, every MCU is filled up with binary ones to get byte-alignment (a slight bias to larger values) and we might need stuffing again.

There may be also some other factors I'm not aware of. If you need more information you have to provide the jpeg-file.

And about your basic assumption:

for rand_data:

 dd if=/dev/urandom of=rand_data count=4096 bs=256

for rand_pseudo (python):

s = "".join(chr(i) for i in range(256))
with file("rand_pseudo", "wb") as f:
    for i in range(4096):
        f.write(s)

Both should be uniform regarding byte-values, shouldn't they? ;)

$ ll rand_*
-rw-r--r-- 1 apuch apuch 1048576 2012-12-04 20:11 rand_data
-rw-r--r-- 1 apuch apuch 1048967 2012-12-04 20:13 rand_data.tar.gz
-rw-r--r-- 1 apuch apuch 1048576 2012-12-04 20:14 rand_pseudo
-rw-r--r-- 1 apuch apuch    4538 2012-12-04 20:15 rand_pseudo.tar.gz

A uniform distribution might indicate a high entropy but its not a guarantee. Also, rand_data might consists out of 1MB of 0x00. Its extremely unlikely, but possible.

Community
  • 1
  • 1
Peter Schneider
  • 1,683
  • 12
  • 31
  • thank you very much! obviously i need to have a deeper understanding of the jpg-file format. anyway, i am still confused why the entropy is high, but the p-value from the chi-square calculation is very low (much lower than for ZIP/DOC/PDF)... – tommynogger Dec 05 '12 at 11:17
0

Here you can find two files: the first one is random data, generated with dev/unrandom (about 46MB), the second one is a normal JPG file (about 9MB). It is obvious that the symbols of the JPG-file are not as equally distributed as in dev/urandom.

If I compare both files:

Entropy: JPG: 7,969247 Bits/Byte RND: 7,999996 Bits/Byte

P-Value of chi-square test: JPG: 0 RND: 0,3621

How can the entropy lead to such a high result?!?

Random Data (dev/urandom JPG

tommynogger
  • 181
  • 7
0

Here is my java code

         public static double getShannonEntropy_Image(BufferedImage actualImage){
         List<String> values= new ArrayList<String>();
           int n = 0;
           Map<Integer, Integer> occ = new HashMap<>();
           for(int i=0;i<actualImage.getHeight();i++){
             for(int j=0;j<actualImage.getWidth();j++){
               int pixel = actualImage.getRGB(j, i);
               int alpha = (pixel >> 24) & 0xff;
               int red = (pixel >> 16) & 0xff;
               int green = (pixel >> 8) & 0xff;
               int blue = (pixel) & 0xff;
//0.2989 * R + 0.5870 * G + 0.1140 * B greyscale conversion
//System.out.println("i="+i+" j="+j+" argb: " + alpha + ", " + red + ", " + green + ", " + blue);
                int d= (int)Math.round(0.2989 * red + 0.5870 * green + 0.1140 * blue);
               if(!values.contains(String.valueOf(d)))
                   values.add(String.valueOf(d));
               if (occ.containsKey(d)) {
                   occ.put(d, occ.get(d) + 1);
              } else {
                  occ.put(d, 1);
              }
              ++n;
       }
    }
    double e = 0.0;
    for (Map.Entry<Integer, Integer> entry : occ.entrySet()) {
         int cx = entry.getKey();
         double p = (double) entry.getValue() / n;
         e += p * log2(p);
    }
 return -e;
  }
Jithu R Jacob
  • 368
  • 2
  • 17