Distribution in jpeg-files
Ignoring the meta-information and the jpeg-header-data, the payload of a jpeg consists of blocks describing huffmann-tables or encoded MCUs (Minimum-Coded-Units, square blocks of the size 16x16). There may be others but this are the most frequent ones.
Those blocks are delimited by 0xFF 0xSS
, where 0xSS
is a specific startcode. Here is the first problem: 0xFF
is a bit more frequent as twalberg mentioned in the comments.
It may happen, that 0xFF
occur in an encoded MCU. To distinguish between this normal payload and the start of a new block, 0xFF 0x00
is inserted. If the distribution of unstuffed payload is perfectly uniform, 0x00
will be twice as often in the stuffed data. To make bad things worse, every MCU is filled up with binary ones to get byte-alignment (a slight bias to larger values) and we might need stuffing again.
There may be also some other factors I'm not aware of. If you need more information you have to provide the jpeg-file.
And about your basic assumption:
for rand_data:
dd if=/dev/urandom of=rand_data count=4096 bs=256
for rand_pseudo (python):
s = "".join(chr(i) for i in range(256))
with file("rand_pseudo", "wb") as f:
for i in range(4096):
f.write(s)
Both should be uniform regarding byte-values, shouldn't they? ;)
$ ll rand_*
-rw-r--r-- 1 apuch apuch 1048576 2012-12-04 20:11 rand_data
-rw-r--r-- 1 apuch apuch 1048967 2012-12-04 20:13 rand_data.tar.gz
-rw-r--r-- 1 apuch apuch 1048576 2012-12-04 20:14 rand_pseudo
-rw-r--r-- 1 apuch apuch 4538 2012-12-04 20:15 rand_pseudo.tar.gz
A uniform distribution might indicate a high entropy but its not a guarantee. Also, rand_data might consists out of 1MB of 0x00
. Its extremely unlikely, but possible.