36

Is there any guideline on selecting chunk size?

I tried different chunk size but none of them give download speed comparable to browser or wget download speed

here is snapshot of my code

 r = requests.get(url, headers = headers,stream=True)
 total_length = int(r.headers.get('content-length'))
 if not total_length is None: # no content length header
 for chunk in r.iter_content(1024):
     f.write(chunk)

Any help would be appreciated.?

Edit: I tried network with different speed.. And I am able to achieve higher speed than my home network.. But when I tested wget and browser.. Speed is still not comparable

Thanks

smci
  • 32,567
  • 20
  • 113
  • 146

4 Answers4

16

You will lose time switching between reads and writes, and the limit of the chunk size is AFAIK only the limit of what you can store in memory. So as long as you aren't very concerned about keeping memory usage down, go ahead and specify a large chunk size, such as 1 MB (e.g. 1024 * 1024) or even 10 MB. Chunk sizes in the 1024 byte range (or even smaller, as it sounds like you've tested much smaller sizes) will slow the process down substantially.

For a very heavy-duty situation where you want to get as much performance as possible out of your code, you could look at the io module for buffering etc. But I think increasing the chunk size by a factor of 1000 or 10000 or so will probably get you most of the way there.

Andrew Gorcester
  • 19,595
  • 7
  • 57
  • 73
  • As I mentioned I tried till 512*1024 still no improvement in speed . So I am guessing it has something to do with request module or windows prioritization of reads.. As I have used youtube-dl a program written in python.. And speed is much higher-- note it is compiled program –  May 01 '14 at 02:03
  • Oh, I misread. I thought you had tried literally 1, 2, 3, 4 and not 1024 * those numbers. Hmm, in that case I'm not sure. I would start by timing a read in one gigantic chunk (say 10 MB or larger) to an in-memory temporary file and seeing how that speed compares; if that works, then write it to disk and compare the results. – Andrew Gorcester May 01 '14 at 02:15
  • Switching time is not an issue. – Marcin May 01 '14 at 14:35
  • speed - code --> 400-500 kb , wget and others 1 -1.2 MB roughly –  May 02 '14 at 00:18
  • Hmm, that's an awfully big gap. If chunk size is appropriately large then I don't have confidence that either that or lack of buffering IO is the problem. – Andrew Gorcester May 02 '14 at 00:57
  • I found that if I raise the chunk size, the speed is improved. – j413254 Nov 05 '20 at 17:06
8

Am probably too late... But the problem is with how you are requesting for the objects(files). You are using non-persistent http connections which means that for each file you incur 2 round-time-trips + the transmission time of the file.This basically means that it takes adds two ping times for each file. with an average ping of let's say 330 ms that's 660 ms for each file. With just ten files this is already around 6-8 seconds. The solution is to use a session instead, which establishs a persistent http connection for all your requests. Also it's easier to use the raise_for_status() method than checking if the content is empty

import requests
session = requests.Session()
r =session.get(url, headers = headers, stream = true)
r.raise_for_status()
for chunk in r.iter_content(1024):
     f.write(chunk)
brayo
  • 115
  • 1
  • 6
2

Based on your code, it's likely that the problem is that you are not using buffered IO. If do that, then each call to write should be very short (because it's buffered and threaded), and you can take pretty big chunks from the wire (3-10Mb).

Marcin
  • 48,559
  • 18
  • 128
  • 201
  • 2
    I am not the downvoter, but I don't think buffered IO is necessary in order to take big chunks from the wire. If anything, the opposite might be more true -- it's more important to take big chunks if you do NOT have buffered IO. Buffered IO will likely improve performance on the margin, but I think the OP has a much bigger problem than buffered vs. unbuffered IO; namely, the chunk size itself. – Andrew Gorcester Apr 30 '14 at 20:41
  • 2
    @AndrewGorcester You've got this backwards: using unbuffered IO means that the app switches between reading and writing in a single thread. – Marcin May 01 '14 at 00:21
  • 1
    I understand that part! The reason I think a large chunk size is useful in that case is that the code will only context switch between reading and writing a few times instead of tens of thousands, and the disk access will be one long write instead of many small ones. Loading a very large chunk into memory before writing is de facto buffering to memory; the write won't be concurrent but at least it will be at full speed, from data in memory, and with minimal context switches. – Andrew Gorcester May 01 '14 at 00:58
  • 1
    @AndrewGorcester Well, no. It's not the switching that's the issue. It's the fact that they're in series. In such a case, maybe reading from remote connection will take a long time; indeed it might take longer than the write. Fundamentally, using synchronous IO is a performance hit for this application. – Marcin May 01 '14 at 14:07
-4

You can change the chunk size conditions as following

~/apps/erpnext/htdocs/frappe-bench/sites/assets/js$ vi desk.min.js

Step1:

chunk_size = _ref$chunk_size === undefined ? 24576 : _ref$chunk_size, Line 
no:2078

Increase as you required like

chunk_size = _ref$chunk_size === undefined ? 2457600 : _ref$chunk_size,

Step2:-

var file_not_big_enough = fileobj.size <= 24576; Line no: 8993

Increase as you required Like

var file_not_big_enough = fileobj.size <= 2457600; 
Phani Rithvij
  • 4,030
  • 3
  • 25
  • 60