E.g. how can it tell that a 4GB text file can be compressed to, say, 200MB? Obviously, it doesn't read all of the contents in 2 or so seconds... so what kind of predictive algorithm(s) does it use?
Asked
Active
Viewed 1,924 times
4
-
1I am only guessing that it samples the file and tries to compress say 1% of the file and estimates based on this. Of course the samples must be scattered all over the file. – Tomasz Nurkiewicz Mar 27 '11 at 16:33
2 Answers
0
-
1Rar does use ppmd but it isn't related in any way to compression ratio estimation. "Partial matching" in PPM is about prediction of next symbol by a short prefix string (not the full match). PPM is more computationally heavy than LZ, so its unlikely that its used for any fast estimation. Anyway, nobody knows how rar does it, but Tomasz is likely right. – Shelwien Mar 27 '11 at 18:40
-
@Shelwien: I'm a great fan of your site! I love data compression - but my understanding is limited ( and I'm a lazy person ) but what do you think of my answer? Is it worth an upvote? Thank you! – Micromega Mar 27 '11 at 19:06
-
-
Presuming that x=1024 ("compress x bits") and log=log2, (-log(x)+log(2))=-10+1=-9. What's that supposed to mean? And how is it related to rar's compression ratio estimation? – Shelwien Mar 27 '11 at 20:47
-
1I mean char a=3, char b=4 then -log(3)+log(2)+log(4)+log(2) = how many bits you need to encode the string aaaabbbb. – Micromega Mar 27 '11 at 21:35
-
1No, that's not how it works. Shannon's entropy is -log2(p)=-log(p)/log(2). In your case its -4*log2(a/(a+b))-4*log2(b/(a+b)) = (8*log(7)-4*log(3)-4*log(4))/log(2) – Shelwien Mar 28 '11 at 01:13
-
Thanks for the fast reply but I don't understand it. I think I've problem with the frequency occurence. Anyway here is a good link:http://www.bearcave.com/misl/misl_tech/wavelets/compression/shannon.html – Micromega Mar 28 '11 at 11:37
0
It takes usually -log(x) + log(2) bits to compress x bits. However this is a highly theoretical value and it depends heavenly on the data you want to compress. For your data you have to record each character and frequency and insert it in the formula. For example try only 3 character first. You want to look for shannon-code.

Micromega
- 12,486
- 7
- 35
- 72