0

I'm using aiohttp to download large files (~150MB-200MB each).

Currently I'm doing for each file:

async def download_file(session: aiohttp.ClientSession, url: str, dest: str):
    chunk_size = 16384
    async with session.get(url) as response:
        async with aiofiles.open(dest, mode="wb") as f:
            async for data in response.content.iter_chunked(chunk_size):
                await f.write(data)

I create multiple tasks of this coroutine to achieve concurrency. I'm wondering:

  1. What is the best value for chunk_size?
  2. Is calling iter_chunked(chunk_size) is better then just doing data = await response.read() and writing that to disk? In that case, how can I report the download progress?
  3. How many tasks made of this coroutine should I create?
  4. Is there a way to download multiple parts of the same file in parallel, is it something that aiohttp already does?
user3599803
  • 6,435
  • 17
  • 69
  • 130
  • Does this [answer](https://stackoverflow.com/a/71285322/11832127) helps you? – ndclt Sep 28 '22 at 13:33
  • @ndclt not exactly, I looking for answer specific to downloading **large** files. The answer also does not consider the value of `chunk_size` – user3599803 Sep 28 '22 at 14:53

1 Answers1

2
  1. Selection of chunk size depends upon what you want in your RAM. If you have a RAM of 4 GB then a chunk size of 512 MB or 1 GB is okay. But if you have a RAM of 1 GB, then you probably don't want a chunk size of 1 GB. So you should set your chunk_size in accordance to available memory.

  2. You should create as much tasks as downloaded files in parallel you want to process. That's totally on you and your use case.

  3. It does not process internally reading the files in bunches. But what you could do, is to fetch a HEAD request to the server asking for the file's Content-Length, subdivide the file size, ask for each part to the server in parallel and then merge it yourself.

Sören Rifé
  • 518
  • 1
  • 4
  • For #2 (how many tasks?)... in addition to your own load-handling capability I would also take into account the receiving server's ability to respond. If the calls are to different endpoints altogether that's no problem, but if you're calling the same endpoint for the different files you may want to throttle so as to avoid overwhelming their server (and potentially having their firewall block you out). May not be a consideration if you're connecting to an endpoint that you know can handle high volume. – teejay Jun 22 '23 at 20:30