Why is reading one byte 20x slower than reading 2, 3, 4, ... bytes from a file?

Question

I have been trying to understand the tradeoff between read and seek. For small "jumps" reading unneeded data is faster than skipping it with seek.

While timing different read/seek chunk sizes to find the tipping point, I came across a odd phenomenon: read(1) is about 20 times slower than read(2), read(3), etc. This effect is the same for different read methods, e.g. read() and readinto().

Why is this the case?

Search in the timing results for the following line 2/3 of the way through:

2 x buffered 1 byte readinto bytearray

Environment:

Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul  5 2016, 11:45:57) [MSC v.1900 32 bit (Intel)]

Timing results:

Non-cachable binary data ingestion (file object blk_size = 8192):
- 2 x buffered 0 byte readinto bytearray:
      robust mean: 6.01 µs +/- 377 ns
      min: 3.59 µs
- Buffered 0 byte seek followed by 0 byte readinto:
      robust mean: 9.31 µs +/- 506 ns
      min: 6.16 µs
- 2 x buffered 4 byte readinto bytearray:
      robust mean: 14.4 µs +/- 6.82 µs
      min: 2.57 µs
- 2 x buffered 7 byte readinto bytearray:
      robust mean: 14.5 µs +/- 6.76 µs
      min: 3.08 µs
- 2 x buffered 2 byte readinto bytearray:
      robust mean: 14.5 µs +/- 6.77 µs
      min: 3.08 µs
- 2 x buffered 5 byte readinto bytearray:
      robust mean: 14.5 µs +/- 6.76 µs
      min: 3.08 µs
- 2 x buffered 3 byte readinto bytearray:
      robust mean: 14.5 µs +/- 6.73 µs
      min: 2.57 µs
- 2 x buffered 49 byte readinto bytearray:
      robust mean: 14.5 µs +/- 6.72 µs
      min: 2.57 µs
- 2 x buffered 6 byte readinto bytearray:
      robust mean: 14.6 µs +/- 6.76 µs
      min: 3.08 µs
- 2 x buffered 343 byte readinto bytearray:
      robust mean: 15.3 µs +/- 6.43 µs
      min: 3.08 µs
- 2 x buffered 2401 byte readinto bytearray:
      robust mean: 138 µs +/- 247 µs
      min: 4.11 µs
- Buffered 7 byte seek followed by 7 byte readinto:
      robust mean: 278 µs +/- 333 µs
      min: 15.4 µs
- Buffered 3 byte seek followed by 3 byte readinto:
      robust mean: 279 µs +/- 333 µs
      min: 14.9 µs
- Buffered 1 byte seek followed by 1 byte readinto:
      robust mean: 279 µs +/- 334 µs
      min: 15.4 µs
- Buffered 2 byte seek followed by 2 byte readinto:
      robust mean: 279 µs +/- 334 µs
      min: 15.4 µs
- Buffered 4 byte seek followed by 4 byte readinto:
      robust mean: 279 µs +/- 334 µs
      min: 15.4 µs
- Buffered 49 byte seek followed by 49 byte readinto:
      robust mean: 281 µs +/- 336 µs
      min: 14.9 µs
- Buffered 6 byte seek followed by 6 byte readinto:
      robust mean: 281 µs +/- 337 µs
      min: 15.4 µs
- 2 x buffered 1 byte readinto bytearray:
      robust mean: 282 µs +/- 334 µs
      min: 17.5 µs
- Buffered 5 byte seek followed by 5 byte readinto:
      robust mean: 282 µs +/- 338 µs
      min: 15.4 µs
- Buffered 343 byte seek followed by 343 byte readinto:
      robust mean: 283 µs +/- 340 µs
      min: 15.4 µs
- Buffered 2401 byte seek followed by 2401 byte readinto:
      robust mean: 309 µs +/- 373 µs
      min: 15.4 µs
- Buffered 16807 byte seek followed by 16807 byte readinto:
      robust mean: 325 µs +/- 423 µs
      min: 15.4 µs
- 2 x buffered 16807 byte readinto bytearray:
      robust mean: 457 µs +/- 558 µs
      min: 16.9 µs
- Buffered 117649 byte seek followed by 117649 byte readinto:
      robust mean: 851 µs +/- 1.08 ms
      min: 15.9 µs
- 2 x buffered 117649 byte readinto bytearray:
      robust mean: 1.29 ms +/- 1.63 ms
      min: 18 µs

Benchmarking code:

from _utils import BenchmarkResults

from timeit import timeit, repeat
import gc
import os
from contextlib import suppress
from math import floor
from random import randint

### Configuration

FILE_NAME = 'test.bin'
r = 5000
n = 100

reps = 1

chunk_sizes = list(range(7)) + [7**x for x in range(1,7)]

results = BenchmarkResults(description = 'Non-cachable binary data ingestion')


### Setup

FILE_SIZE = int(100e6)

# remove left over test file
with suppress(FileNotFoundError):
    os.unlink(FILE_NAME)

# determine how large a file needs to be to not fit in memory
gc.collect()
try:
    while True:
        data = bytearray(FILE_SIZE)
        del data
        FILE_SIZE *= 2
        gc.collect()
except MemoryError:
    FILE_SIZE *= 2
    print('Using file with {} GB'.format(FILE_SIZE / 1024**3))

# check enough data in file
required_size = sum(chunk_sizes)*2*2*reps*r
print('File size used: {} GB'.format(required_size / 1024**3))
assert required_size <= FILE_SIZE


# create test file
with open(FILE_NAME, 'wb') as file:
    buffer_size = int(10e6)
    data = bytearray(buffer_size)
    for i in range(int(FILE_SIZE / buffer_size)):
        file.write(data)

# read file once to try to force it into system cache as much as possible
from io import DEFAULT_BUFFER_SIZE
buffer_size = 10*DEFAULT_BUFFER_SIZE
buffer = bytearray(buffer_size)
with open(FILE_NAME, 'rb') as file:
    bytes_read = True
    while bytes_read:
        bytes_read = file.readinto(buffer)
    blk_size = file.raw._blksize

results.description += ' (file object blk_size = {})'.format(blk_size)

file = open(FILE_NAME, 'rb')

### Benchmarks

setup = \
"""
# random seek to avoid advantageous starting position biasing results
file.seek(randint(0, file.raw._blksize), 1)
"""

read_read = \
"""
file.read(chunk_size)
file.read(chunk_size)
"""

seek_seek = \
"""
file.seek(buffer_size, 1)
file.seek(buffer_size, 1)
"""

seek_read = \
"""
file.seek(buffer_size, 1)
file.read(chunk_size)
"""

read_read_timings = {}
seek_seek_timings = {}
seek_read_timings = {}
for chunk_size in chunk_sizes:
    read_read_timings[chunk_size] = []
    seek_seek_timings[chunk_size] = []
    seek_read_timings[chunk_size] = []

for j in range(r):
    #file.seek(0)
    for chunk_size in chunk_sizes:
        buffer = bytearray(chunk_size)
        read_read_timings[chunk_size].append(timeit(read_read, setup, number=reps, globals=globals()))
        #seek_seek_timings[chunk_size].append(timeit(seek_seek, setup, number=reps, globals=globals()))
        seek_read_timings[chunk_size].append(timeit(seek_read, setup, number=reps, globals=globals()))

for chunk_size in chunk_sizes:
    results['2 x buffered {} byte readinto bytearray'.format(chunk_size)] = read_read_timings[chunk_size]
    #results['2 x buffered {} byte seek'.format(chunk_size)] = seek_seek_timings[chunk_size]
    results['Buffered {} byte seek followed by {} byte readinto'.format(chunk_size, chunk_size)] = seek_read_timings[chunk_size]


### Cleanup
file.close()
os.unlink(FILE_NAME)

results.show()
results.save()

Edit 2020-02-24:

@finefoot requested the _utils package to be able to run above code.

from collections import OrderedDict
from math import ceil
from statistics import mean, stdev
from contextlib import suppress
import os
import inspect

class BenchmarkResults(OrderedDict):
    def __init__(self, *args, description='Benchmark Description', **kwArgs):
        self.description = description
        return super(BenchmarkResults, self).__init__(*args, **kwArgs)

    def __repr__(self):
        """Shows the results for the benchmarks in order of ascending duration"""
        characteristic_durations = []
        for name, timings in self.items():
            try:
                characteristic_durations.append(_robust_stats(timings)[0])
            except ValueError:
                if len(timings) > 1:
                    characteristic_durations.append(mean(timings))
                else:
                    characteristic_durations.append(timings[0])
        indx = _argsort(characteristic_durations)
        repr = '{}:\n'.format(self.description)
        items = list(self.items())
        for i in indx:
            name, timings = items[i]
            repr += '- {}:\n'.format(name)
            try:
                stats = _robust_stats(timings)
                repr += '      robust mean: {} +/- {}\n'.format(_units(stats[0]), _units(stats[1]))
            except ValueError:
                repr += '      timings: {}\n'.format(', '.join(map(_units, timings)))
            if len(timings) > 1:
                repr += '      min: {}\n'.format(_units(min(timings)))
        return repr

    def show(self):
        print(self)

    def save(self):
        caller = inspect.stack()[1]
        filename = os.path.splitext(caller.filename)[0] + '.log'
        with open(filename, 'w') as logfile:
            logfile.write(repr(self))


def _units(seconds, significant_figures=3):
    fmt = '{:.%sg} {}' % significant_figures
    if seconds > 1:
        return fmt.format(seconds, 's')
    elif seconds > 1e-3:
        return fmt.format(seconds*1e3, 'ms')
    elif seconds > 1e-6:
        return fmt.format(seconds*1e6, 'µs')
    elif seconds < 1e-6:
        return fmt.format(seconds*1e9, 'ns')
    elif seconds > 60:
        return fmt.format(seconds/60, 'min')
    else:
        return fmt.format(seconds/3600, 'hrs')
    raise ValueError()

def _robust_stats(timings, fraction_to_use=0.8):
    if len(timings) < 5:
        raise ValueError('To calculate a robust mean, you need at least 5 timing results')
    elts_to_prune = int(len(timings) * (1 - fraction_to_use))
    # prune at least the highest and the lowest result
    elts_to_prune = elts_to_prune if elts_to_prune > 2 else 2
    # round to even number --> symmetic pruning
    offset = ceil(elts_to_prune / 2)

    # sort the timings
    timings.sort()
    # prune the required fraction of the elements
    timings = timings[offset:-offset]
    return mean(timings), stdev(timings)

def _argsort(seq):
    # http://stackoverflow.com/questions/3071415/efficient-method-to-calculate-the-rank-vector-of-a-list-in-python
    return sorted(range(len(seq)), key=seq.__getitem__)

if __name__ == '__main__':
    pass

For small strings, the effects depend on the _**pointer size of the system**_, and the **size of the `Py_UNICODE/wchar_t`** type. https://www.python.org/dev/peps/pep-0393/#performance — veganaiZe, Nov 26 '18 at 19:09
@veganaiZe Could you expand a bit on the reason? I did not quite follow your point. This question seems to get quite a bit of traffic which suggests I am not the only one interested in your explanation. — ARF, Aug 10 '19 at 10:49
This is quite a lot of code for a minimal example. Is the first part with the caching necessary to reproduce the issue? (Didn't look that closely at it. Also `_utils.BenchmarkResults` is not part of the stdlib. Which package is that?) Do you get the same results if you randomly mix up the order of the tests? — finefoot, Aug 30 '19 at 15:54
@finefoot Sorry no. Does the problem still exist with current python? For my original application I had to read the entire file anyway, so I just read larger blocks and indexed into them. — ARF, Feb 23 '20 at 17:24
@finefoot Sorry, I had overlooked that. I did manage to find it on my hard drive and appended it to the OP. If you run the benchmark, let me know if you can reproduce the effect. - Just curious if this is still a problem. — ARF, Feb 24 '20 at 10:11

score 2 · Answer 1 · answered Feb 24 '20 at 13:10

2

I was able to reproduce the issue with your code. However, I noticed the following: can you verify that the issue disappears if you replace

file.seek(randint(0, file.raw._blksize), 1)

with

file.seek(randint(0, file.raw._blksize), 0)

in setup? I think you might just run out of data at some point during reading 1 byte. Reading 2 bytes, 3 bytes and so on won't have any data to read, so that's why it's much faster.

answered Feb 24 '20 at 13:10

finefoot

9,914
7
59
102

I will take a look at it when I have some time. But I have some doubts regarding your analysis. If you were correct, reading 1-byte at a time would be slowest, then 2 bytes, 3 bytes, 4 bytes, etc. That is however not what I am seeing. On my machine 4 bytes at a time is slowest, then 7 bytes, 2 bytes, 4 bytes etc. - How does this fit with your analysis? – ARF Feb 24 '20 at 19:30
If I change the `whence` argument to `0`, all small bytes 1, 2, 3, ... are approximately the same. For 2401 and higher, there's a noticeably increase. "If you were correct, reading 1-byte at a time would be slowest, then 2 bytes, 3 bytes, 4 bytes" I meant to say that 1 byte is slow (because there's still data to read) and somewhere around 2 bytes, the `seek` has reached the end of the file and 2, 3, 4 etc. are only reading zeros and are therefore the same duration. (Which is what your log from your text above shows, right?) – finefoot Feb 24 '20 at 21:13
After looking at it again, I believe you are right that my test procedure is flawed because it can run out of file to read prematurely. Using whence=0 is not good either though: 1. you may be going back and "read" data that is already buffered, 2. you may still run out of file to read since it can sometime seek to the end of the file. Better might be something along the lines of file.seek(randint(0, file.raw._blksize/r/n/some_margin), 1). What do you think? – ARF Feb 25 '20 at 18:54
I really don't have much experience with how buffered vs. un-buffered IO behaves. I just stumbled upon your post and fiddled around a little bit, noticing the difference the change of `whence` makes. ;) I'd probably try to further clean up and reduce the code as a first step. You don't need more than 0, 1 and 2 chuck sizes to show the issue. This should simplify things radically. Also, something I have noticed: If you swap 1 and 2 in `chuck_sizes`, so it's now `[0, 2, 1, ...]`, then 2 is the one which takes longer. – finefoot Feb 26 '20 at 14:30

score 1 · Answer 2 · answered Aug 26 '19 at 18:33

Reading from a file handle byte-for-byte will be generally slower than reading chunked.

In general, every read() call corresponds to a C read() call in Python. The total result involves a system call requesting the next char. For a file of 2 kb, this means 2000 calls to the kernel; each requiring a function call, request to the kernel, then awaiting response, passing that through the return.

Most notable here is awaiting response, the system call will block until your call is acknowledged in a queue, so you have to wait.

Fewer calls the better, so more bytes is faster; which is why buffered io is in fairly common use.

In python, buffering can be provided by io.BufferedReader or through the buffering keyword argument on open for files

Thanks but I believe you have missed the point of my question. `x` number of `read(1)` calls is (was?) much slower than `x` number of e.g. `read(2)` calls. The number of system calls is identical in both cases! — ARF, Aug 27 '19 at 08:00
in the tests, was the number of bytes available adjusted to fit that metric? because if the bytes get exhausted there's a good chance the implementation locks into an instant return of 0 bytes — Catamondium, Aug 27 '19 at 16:27

score 0 · Answer 3 · answered Feb 13 '19 at 16:24

I have seen similar situations while dealing with arduinos interfacing with EEPROM. Basically, in order to write or read from a chip or data structure, you have to send a write/read enable command, send a starting location, and then grab the first character. If you grab multiple bytes, however, most chips will auto-increment their target address registers. Thus, there is some overhead for starting a read/write operation. It's the difference between:

Start communications
Send read enable
Send read command
Send address 1
Grab data from target 1
End communications
Start communications
Send read enable
Send read command
Send address 2
Grab data from target 2
End communications

and

Start communications
Send read enable
Send read command
Send address 1
Grab data from target 1
Grab data from target 2
End communications

Just, in terms of machine instructions, reading multiple bits/bytes at a time clears a lot of overhead. It's even worse when some chips require you to idle for a few clock cycles after the read/write enable is send to let a mechanical process physically move a transistor into place to enable the reading or writing.

The OP is on an Intel x86 CPU, which definitely has efficient byte load/store instructions (at least for cacheable memory regions). The "Non-cachable" in the question is about *file* caching, not uncacheable memory; the memory reads/writes will be to write-back cacheable memory regions. ([Are there any modern/ancient CPUs / microcontrollers where a cached byte store is actually slower than a word store?](//stackoverflow.com/q/54217528): yes some non-x86 CPUs have slightly slower byte stores. But x86 has no penalty). But this question is using Python, so there's a boatload of software overh — Peter Cordes, Feb 13 '19 at 17:40

Why is reading one byte 20x slower than reading 2, 3, 4, ... bytes from a file?

3 Answers3

Linked