1

I have a List<FileInfo> of files

List<FileInfo> files = GetFiles();

which has about 2 GB of files. Now I need to chunk these files into 500MB parts. In this case the result will be 4 List<FileInfo> with the Sum of all Files is below 500MB. I have no Idea how to apply Sum() on this..

List<List<FileInfo>> result = files.GroupBy(x => x.Length / 1024 / 1014 < 500)
                                   .Select(x => x.ToList()).ToList();
Impostor
  • 2,080
  • 22
  • 43
  • what is maximum and minimum size of files? – FarhadGh Dec 03 '19 at 08:16
  • Maximum 500MB, so it should be as many files be used until 500MB are reached. Minimum is not defined - so 1 file – Impostor Dec 03 '19 at 08:20
  • `FileInfo` contains information about the file, not the actual file. Please add more details. For example, you have a list of file information, which contains 7 files all larger than 500Mb, what you expect to have on result? – SᴇM Dec 03 '19 at 08:22
  • 1
    You cannot accomplish this with linq or group by. What you can do is set a minimum and maximum and add to sublists while you get into range of min and max (for example, 450 to 500 mb). and in next iterate, add the child list to main list of lists, and go to next child list. If you need sample code, let me know. – FarhadGh Dec 03 '19 at 08:26
  • @SᴇM the files are up to 100KB – Impostor Dec 03 '19 at 08:26
  • 1
    I'd suggest you first come up with a non-linq solution that works for you, then take a look if it can be "linqified". Divide and Conquer. Also, Linq is just a tool. If it doesn't work use the tool that does. – Fildor Dec 03 '19 at 08:31
  • 1
    Some questions: Do you want the chunks to be evenly distributed size-wise? Or could you live with , let's say 1- 500MB 2- 480MB 3 - 450MB 4 - 470MB 5 - 100MB ? – Fildor Dec 03 '19 at 08:35
  • 1
    @Fildor the chunk should take as much files as possible - as long it's below 500MB – Impostor Dec 03 '19 at 08:37
  • 1
    OK. Because that's kind of important to the algorithm. You can put much effort in coming close to the 500MB threshold or less. Less will probably make the algorithm less complex and maybe faster but may result in more size-jitter. More effort will likely be slower and more complex but give you maxed out chunks. Maybe it's worthwhile to try both and have them run against each other to see which one is a best-fit. – Fildor Dec 03 '19 at 08:41

2 Answers2

2

Here is something that works.

List<FileInfo> files = new List<FileInfo>();
List<List<FileInfo>> listOfLists= new List<List<FileInfo>>();
files.ForEach(x => {
     var match = listOfLists.FirstOrDefault(lf => lf.Sum(f => f.Length) + x.Length < 500*1024*1024);
     if (match != null)
         match.Add(x);
     else
         listOfLists.Add(new List<FileInfo>() { x });
});
FarhadGh
  • 134
  • 11
2

Here is a generic BatchBySize extension method that you could use:

/// <summary>
/// Batches the source sequence into sized buckets.
/// </summary>
public static IEnumerable<TSource[]> BatchBySize<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, long> sizeSelector,
    long maxSize)
{
    var buffer = new List<TSource>();
    long sumSize = 0;
    foreach (var item in source)
    {
        long itemSize = sizeSelector(item);
        if (buffer.Count > 0 && checked(sumSize + itemSize) > maxSize)
        {
            // Emit full batch before adding the new item
            yield return buffer.ToArray(); buffer.Clear(); sumSize = 0;
        }
        buffer.Add(item); sumSize += itemSize;
        if (sumSize >= maxSize)
        {
            // Emit full batch after adding the new item
            yield return buffer.ToArray(); buffer.Clear(); sumSize = 0;
        }
    }
    if (buffer.Count > 0) yield return buffer.ToArray();
}

Usage example:

List<FileInfo[]> result = files
    .BatchBySize(x => x.Length, 500_000_000)
    .ToList();
Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104