7

What is the optimal algorithm for filling a set of Blu-ray disc given many hundred gigabytes of assets of varying sizes?

I am trying to consolidate a large number of old CDROMS, DVDs and small hard drives and put everything in a database indexed by MD5 signature. A daunting task for sure.

What I currently do is to sort the asset sizes (usually directory sizes) in descending order, start inserting the largest assets in the fill list skipping any that don't fit until I run out of assets. It runs almost instantaneously but I would not mind running overnight if necessary one time.

It usually gives me 95% or more utilization, but I am sure there is a way to use other combinations to give a higher efficiency. With huge items like disk images, I can get quite low utilization with this primitive method.

My thought is to take all combinations of the assets taken, 1 then 2, then 3, ... items at a time and keep a running value for the highest byte count < 25,025,314,816 bytes pointing to the array which sums to it. When I get to the point that I have so many assets taken at one time that none of the combinations will fit, stop and use the array pointed to by the running highest counter.

Is this the best possible algorithm?

There are 2 Perl modules which seem up to the task, Algorithm-Combinatorics and Math-Combinatorics. Any advice on which is faster, more stable, cooler?

My scheme is to write a script to calculate the sizes of a large number of directories and show me the optimal contents of the dozens of disks to burn.

And, I don't want to just fill on a file by file basis as I want entire directories on the same disc.

Andrew Barber
  • 39,603
  • 20
  • 94
  • 123
user1556337
  • 151
  • 1
  • 4

4 Answers4

5

This is an NP-complete problem known as bin packing. There is no known polynomial-time algorithm that solves it optimally. In other words, the optimal solution cannot be found without basically trying all solutions.

On the plus side, a very simple heuristic like "put the largest remaining folder on the first disk that has room" will guarantee that you will use fewer than twice as many disks as the best case. (You can read more details on the problem's Wikipedia article).

dma_k
  • 10,431
  • 16
  • 76
  • 128
chepner
  • 497,756
  • 71
  • 530
  • 681
2

The algorithm is called 1d bin-packing. The algorithm is very fast but not optimal. You can also use a brute force algorithm but the search space is very big. Here is a program with a greedy algorithm: http://www.phpclasses.org/package/2027-PHP-Pack-files-without-exceeding-a-given-size-limit.html

Micromega
  • 12,486
  • 7
  • 35
  • 72
0

The most practical method I have yet found to efficiently fill my Blu-Ray disks.

I make a list of fully qualified paths to all of the available files to burn.

Then (arbitrarily) decide on how many directory levels to consider a bunch or accept a command line option for it. This is to keep directories full of like items all together on a single blu-ray. There is also a STUFF option to insert the largest files first and when a file would cause an overflow, look to the next smaller one until you run out of files or space.

Make a hash with each directory as key and total size of the files it contains as the data. Also keep a parallel hash with the count of files per directory as slack space and directory overhead apparently add up and have to be accounted for.

Pick 22 as the magic number. If you have <= 22 directories, try all combinations to find the one closest to but not over 25.025 GB. If you have more than 22, just use the 22 largest. I use the Perl module Algorithm::Combinatorics to find all of the combinations. Through trial and mostly error, I determined that combinations of 21 items takes just a few seconds. 23 items takes many minutes which is longer than my attention span. 22 takes about 35 seconds.

An output directory is also accepted and checked for existing data. There is an option to move the files (copy, check size and unlink).

Every time I bought a new hard drive, it was usually twice as large as the previous one so I would just copy everything over. With a Nikon D800E (Extreme!), HDR and Panoramas, I finally ran out of space.

My project was to unique, weed and consolidate 15 years worth of [mostly junk] photos, videos, movies, music, etc. I inventoried roughly a dozen storage devices, calculated MD5 signatures and put them all in a database. I picked one drive as master for pics and one for video and nuked everything else. I found 8 copies of some stuff!

I now have about 10 TB of free disk space!!!

Below the function which does all of the real work in case anybody is interested.

=============================================== Oops! Your answer couldn't be submitted because:

Your post appears to contain code that is not properly formatted as code

The stupid web page mangled my pristine code. Sorry :(..

user1556337
  • 151
  • 1
  • 4
-2

Use the algorithm from "Knapsack" optimization problem.

http://en.wikipedia.org/wiki/Knapsack_problem

  1. Set weight to be equal to the file size
  2. Set value to be equal to "weight"
  3. Run the algorithm for every subsequent disk to be packed

It may not be the best choice (it will maximize the fill-factor of the next disk instead of minimizing the number of total disks needed), but it's well documented and easy to find examples and working code for the programming language of your choice (even Spreadsheets) on the web.

anttix
  • 7,709
  • 1
  • 24
  • 25
  • No. Knappsack has 2 variables. – Micromega Jul 27 '12 at 01:19
  • So what? You can set all elements to have a "value" of 1 for example. – anttix Jul 27 '12 at 01:23
  • Sure, you can do this. But does it work for metric of bytes and kilobytes? It's something virtual. – Micromega Jul 27 '12 at 01:25
  • I don't quite follow. What difference does it make what units we use to denote "weights" when solving the Knapsack? – anttix Jul 27 '12 at 01:30
  • For example in euklidian space there is the triangle inequality. – Micromega Jul 27 '12 at 01:33
  • The problem OP has is 1d like you suggested. How does the triangle apply? – anttix Jul 27 '12 at 01:34
  • Is better suited then weights? – Micromega Jul 27 '12 at 01:38
  • I am not sure I follow the triangle idea. Since bin packing is known to be NP-Hard there is no easy algorithm to find optimal solution. Solving Knapsack for every subsequent disk to be burned is definitely a better approximation than the simple "greedy" algorithm OP now has. It may not be the best for the task, however implementations are widely available and easy to understand and use. – anttix Jul 27 '12 at 01:43
  • 2
    If weights and values are equal, then knapsack just reduces to bin packing. – chepner Jul 27 '12 at 15:40