1

I have a text file with hundreds of thousands of image urls. I want to download all these images and put them in one zip file.

I can get each image and add it to the zip directly one by one without saving it to disk:

using StreamReader reader = new StreamReader("UrlList.txt");
while ((_url = reader.ReadLine()) != null)
{
    using var responseStream = await _httpClient.GetStreamAsync(_url);
    using var zipStream = new FileStream("images.zip", FileMode.OpenOrCreate);
    using var zip = new ZipArchive(zipStream, ZipArchiveMode.Update);
    var file = zip.CreateEntry(fileName);
    using var fileStream = file.Open();
    responseStream.CopyTo(fileStream);
}

but it looks like an inefficient way (40k images in ~4,5h in my case). And I am not sure there is no memory leak. ZipArchive does not allow adding files in parallel.

Is there any way to do this efficiently?

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
AnMSLbR
  • 21
  • 1
  • 3
  • 2
    You think it's inefficient only because it takes too long. I think you haven't analyzed where the bottleneck is, downloading, compressing, or writing? Since the order in which each file is processed is fixed, you need to know which step takes the longest. – shingo Apr 01 '23 at 06:43
  • Fastest way is to use FTP to get all files at one time using asterisk and put into a temp folder. Than zip temp folder. ZIP utilities are designed to add multiple files from a disk. Adding one file at a time is less efficient. – jdweng Apr 01 '23 at 06:56
  • @jdweng I tried to avoid the intermediate step of saving the files to disk to avoid wasting memory, but perhaps zipping the folder with all the files will save time. I will try this way – AnMSLbR Apr 01 '23 at 10:57
  • @TheodorZoulias I use .NET 6 – AnMSLbR Apr 01 '23 at 10:57
  • @shingo at first glance it looks like the bottleneck is adding files one by one to the archive, but it may be possible to save time at the download stage as well. i'll take a look – AnMSLbR Apr 01 '23 at 11:06

1 Answers1

1

My suggestion is to parallelize the downloading of the images with the .NET 6 API Parallel.ForEachAsync, and synchronize the interaction with the zip file using a SemaphoreSlim(1, 1):

SemaphoreSlim semaphore = new(1, 1);

ParallelOptions options = new() { MaxDegreeOfParallelism = 5 };

IEnumerable<string> lines = File.ReadLines("UrlList.txt"); // Open the file

await Parallel.ForEachAsync(lines, options, async (line, ct) =>
{
    // Download the image in parallel.
    using Stream responseStream = await _httpClient.GetStreamAsync(line, ct);
    using MemoryStream buffer = new();
    await responseStream.CopyToAsync(buffer, ct);
    buffer.Position = 0;

    // Store it in the zip file sequentially.
    await semaphore.WaitAsync(ct);
    try
    {
        using FileStream zipStream = new FileStream("images.zip", FileMode.OpenOrCreate);
        using ZipArchive zip = new(zipStream, ZipArchiveMode.Update);
        ZipArchiveEntry file = zip.CreateEntry(fileName);
        using Stream fileStream = file.Open();
        await buffer.CopyToAsync(fileStream, ct);
    }
    finally { semaphore.Release(); }
});

Finding the optimal value for the MaxDegreeOfParallelism configuration might require some experimentation. You can find a small piece of guidance here.

The images will not be stored in the zip file in exactly the same order as in the "UrlList.txt" file. In case this is a problem, you will have to find a different solution than the Parallel.ForEachAsync method, like the TPL Dataflow or the PLINQ.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
  • 1
    Thanks, good solution, it looks like it reduces the time spent by ~3.5 times with MaxDegreeOfParallelism = 10. – AnMSLbR Apr 01 '23 at 21:06
  • @AnMSLbR you might be able to optimize it further by moving the creation of the `zipStream` and `zip` variables outside of the parallel loop. I assume that opening and closing the file has some overhead, so opening it just once should be faster. – Theodor Zoulias Apr 02 '23 at 07:38