Compress large similar files

Question

We have a huge and growing number of windows images and other large files. In there I assume most of the files are identical over multiple files.

Is there a compression system or maybe even filesystem which detects this?

score 2 · Accepted Answer · answered Jun 13 '09 at 08:04

2

Yes, you need a technique called 'deduplication'. It's not like compression which looks at individual files, it looks for block-level repetition - so if you had a million copies of the same file it should only store one real copy and then just refer to that with a million pointers. Let us know which OS you're looking for and I'll try to find a program that does it for you 'in server', it's very often a function of a NAS/SAN system such as NetApp filers.

answered Jun 13 '09 at 08:04

Chopper3

101,299
9
108
239

1

ZFS will have built-in deduplication in the next few months, not sure there are any really trustworthy DAS options for Linux just yet but it won't be long. – Chopper3 Jun 13 '09 at 11:59

score 1 · Answer 2 · answered Oct 10 '11 at 10:43

1

You can try DZO (still in beta) from http://essensolabs.com/ which combines Deduplication and Lossless Compression.

answered Oct 10 '11 at 10:43

Neeraj

11
1

score 0 · Answer 3 · answered Jun 13 '09 at 08:04

0

Do you mean "images" as in ISO/WIM files, or just directory structures containing different Windows install images?

If it's the latter, Windows Storage Server 2003 R2 uses Single Instance Storage, which is a fancy way of saying it detects multiples copies of identical files, stores them in the SIS Common Store and places links in the places the files are meant to be. This happens low down in the filesystem and is transparent to applications.

The problem is that you cannot simply purchase licences for the WSS product, it is only available through OEM partners, so you'd have to buy a new box from HP, Dell, etc. which runs WSS.

answered Jun 13 '09 at 08:04

ThatGraemeGuy

15,473
12
53
79

iso files or similar. can be anything from a virtual windows installation to a linux os. – yawniek Jun 13 '09 at 10:48
AFAIK the SIS within WSS only spots completely identical files. So if you have two identical VHD except that within one the datestamp of a file is slightly different they will get stored twice. If anyone knows any different I'd love to know! – Joel Mansford Jun 22 '09 at 08:48

Compress large similar files

3 Answers3