Fastest way to read many inputs in PyPy3 and what is BytesIO doing here?

Question

Recently I was working on a problem that required me to read many many lines of numbers (around 500,000).

Early on, I found that using input() was way too slow. Using stdin.readline() was much better. However, it still was not fast enough. I found that using the following code:

import io, os
input = io.BytesIO(os.read(0,os.fstat(0).st_size)).readline

and using input() in this manner improved the runtime. However, I don't actually understand how this code works. Reading the documentation for os.read, 0 in os.read(0, os.fstat(0).st_size) describes the file we are reading from. What file is 0 describing? Also, fstat describes the status of the file we are reading from but apparently that input is to denote the max number of bytes we are reading?

The code works but I want to understand what it is doing and why it is faster. Any help is appreciated.

Amadan · Accepted Answer · 2020-03-09T04:26:40.223

0 is the file descriptor for standard input. os.fstat(0).st_size will tell Python how many bytes are currently waiting in the standard input buffer. Then os.read(0, ...) will read that many bytes in bulk, again from standard input, producing a bytestring.

(As an additional note, 1 is the file descriptor of standard output, and 2 is standard error.)

Here's a demo:

echo "five" | python3 -c "import os; print(os.stat(0).st_size)"
# => 5

Python found four single-byte characters and a newline in the standard input buffer, and reported five bytes waiting to be read.

Bytestrings are not very convenient to work with if you want text — for one thing, they don't really understand the concept of "lines" — so BytesIO fakes an input stream with the passed bytestring, allowing you to readline from it. I am not 100% sure why this is faster, but my guesses are:

Normal read is likely done character-wise, so that one can detect a line break and stop without reading too much; bulk read is more efficient (and finding newlines post-facto in memory is pretty fast)
There is no encoding processing done this way

Fastest way to read many inputs in PyPy3 and what is BytesIO doing here?

1 Answers1