Why is this call of len(s) so slow?

Question

According to this answer, the call len(s) has a complexity of O(1). Then why is it, that calling it on a downloaded 27kb file so much slower than on a 1kb file?

27kb

>>> timeit.timeit('x = len(r.text)', 'from requests import get; r = get("https://cdn.discordapp.com/attachments/280190011918254081/293010649754370048/Journal.170203183244.01.log")', number = 20)
5.78126864130499

1kb

>>> timeit.timeit('x = len(r.text)', 'from requests import get; r = get("https://cdn.discordapp.com/attachments/280190011918254081/293016636288663562/Journal.170109120508.01.log")', number = 20)
0.00036539355403419904

The problem is, that this example ran on my dev-machine, which is a normal work pc. The machine where the code should run on is a RaspberryPi, which is orders of magnitude slower.

Because you're downloading more data, which takes more time? What did you expect? — spectras, Mar 19 '17 at 13:51
@MartijnPieters> 6s to decode a total of 500kb? Seems very unlikely to me. Also, if I remember well, response's content is a lazy property, not just with decoding it but also with actually reading it from the socket in the first place. — spectras, Mar 19 '17 at 14:03
@MartijnPieters> 20 repetitions of 27kb, that is a total of 540kb — spectras, Mar 19 '17 at 14:06
@spectras: it's easy to reproduce (3.6 secs for me). The second argument to `timeit()` is the setup, not the timed test. — Martijn Pieters, Mar 19 '17 at 14:07
@spectras: the time includes the chardet library being used to guess the encoding. — Martijn Pieters, Mar 19 '17 at 14:09
@MartijnPieters> alright, I tested with a charset defined, so it was much, much faster. I wrongly assumed `text` would be lazy all the way to the download, so it would end up blocking inside the timed test. (basically what would happen with stream=True). — spectras, Mar 19 '17 at 14:15

ShadowRanger · Accepted Answer · 2017-03-19T13:58:28.787

7

Try assigning r.text to a local variable during your setup phase. It's a lazy property, not a plain attribute, and you're timing the work of constructing the value, which decodes from the internally cached bytes to str, not just the len call.

Hat tip to Martijn Pieters for the precise references!

edited Mar 19 '17 at 13:58

answered Mar 19 '17 at 13:52

ShadowRanger

143,180
12
188
271

2

It's decoding the data from bytes to Unicode each time. – Martijn Pieters Mar 19 '17 at 13:53
@MartijnPieters: Ah, I'll add that to the answer. Thanks! I was looking for the code to make a more complete answer. – ShadowRanger Mar 19 '17 at 13:55
1

See http://docs.python-requests.org/en/master/user/quickstart/#response-content and http://docs.python-requests.org/en/master/api/#requests.Response.text – Martijn Pieters Mar 19 '17 at 13:57
Are you sure this can explain the timing disparity? I would expect a 27kB file to decode to Unicode approximately 27 times slower than a 1kB file. If I'm counting the zeros right, the OP seems to be measuring **4 orders of magnitude** of timing difference. What is going on there? – user4815162342 Mar 19 '17 at 14:01
1

@user4815162342: yes, because there is no codec in the headers so chardet is used to do a statistical analysis and guess an encoding to use. Each time. – Martijn Pieters Mar 19 '17 at 14:10
1

@MartijnPieters One would expect the time it takes to perform the analysis to be linear in text size, though. (What the OP measured seems cubic.) Taking into account the time to set up the analysis, it would even be reasonable to measure *slower* performance (per unit of text) on smaller files. – user4815162342 Mar 19 '17 at 14:18

Why is this call of len(s) so slow?

1 Answers1