-3

Python File readlines() method return all lines in the file, as a list where each line is an item in the list object.

As an example, here f.readlines() returns a list.

f = open("file.txt", "r")
print(f.readlines())

How to implement an equivalent file.readlines() equivalent using Python mmap?

I have to read all lines as a list as opposed to reading a single line from a file.

This is what I have tried so far based on How to read lines from a mmapped file?.

    lines = []
    with open(path, "rt", encoding="utf-8") as f:
        m = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
        while True:
            line = m.readline()
            lines.append(line)
            if line == "":
                break
            print(line)
        m.close()

However, this code iterates forever and is not working as expected.

Exploring
  • 2,493
  • 11
  • 56
  • 97
  • May I ask *why* you're doing this? While `mmap` does allow it, to my knowledge there is no benefit to using it for line reading over using the original file object the same way; it doesn't get any of the benefits of `mmap` (rereading data cheaply, easy random access, zero-copy behavior when used with `memoryview`s, etc.). – ShadowRanger Sep 15 '21 at 03:22
  • to optimize file read - I am processing millions of files from a directory. – Exploring Sep 15 '21 at 03:23
  • 1
    Note, you should generally just be doing `print(list(f))` instead of `print(f.readlines())`. – juanpa.arrivillaga Sep 15 '21 at 03:25
  • 1
    @juanpa.arrivillaga: I was sad they kept the `.readlines()` method when they moved to Python 3. So many people use it unnecessarily (`for line in f.readlines():` drives me nuts), and `list(f)` already covers that use case, so it both encouraged bad code and added yet another way to do something for no reason. Blech. – ShadowRanger Sep 15 '21 at 03:28
  • 1
    @Exploring: Yeah, you're welcome to test, but I'm fairly sure `mmap` won't save you a thing if you're just eagerly slurping all the lines from the file sequentially as a bulk read. Especially for the use case here, where you end up doing a ton of work at the Python level, per-line, where the file object would push most of the work to the C layer and do a lot of it in bulk more efficiently. – ShadowRanger Sep 15 '21 at 03:29
  • @ShadowRanger thanks for the pointer. But whats my alternative here to optimize file read? – Exploring Sep 15 '21 at 03:31
  • @Exploring what exactly are you trying to optimize? – juanpa.arrivillaga Sep 15 '21 at 03:35
  • @ShadowRanger you could use `iter(m.readline, b'')` to push most of that work into the C layer. – juanpa.arrivillaga Sep 15 '21 at 03:36
  • @juanpa.arrivillaga I have to read millions of file from a directory and process them. I have already parallelized the code and file read is the bottleneck at this point. – Exploring Sep 15 '21 at 03:37
  • Does the whole file need to be in memory? Because then i don't think mmap will help you, I/O is your bottleneck – juanpa.arrivillaga Sep 15 '21 at 03:37
  • Yes, the whole file needs to be in the memory as I gave to analyze the file content. – Exploring Sep 15 '21 at 03:39
  • @juanpa.arrivillaga: Sure. It's still going to be slower though, in the same way using `for line in iter(fileobj.readline, b''):` is slower than `for line in fileobj:`; even pushed to the C layer, there's more per-line overhead. Personally, I'd favor the walrus on modern Python if you needed to do this that way, e.g. `while line := m.readline().decode('utf-8'):` but again, you don't need to do this, and it's still going to be slower than the tools that have been hyperoptimized specifically for iterating a file by line. – ShadowRanger Sep 15 '21 at 03:39
  • 1
    @Exploring: What exactly does "analyze the file content" consist of? You could potentially run regexes against the `mmap` object itself (it implements the buffer protocol, so a lot of things that work with arbitrary `bytes`-like objects can use it in zero-copy, page-in on demand mode), and wrapped in `memoryview`, you can prevent accidental copies for more complex manipulation. – ShadowRanger Sep 15 '21 at 03:41
  • @ShadowRanger thanks for your comments. I read two files and then run `difflib` to compute deltas. – Exploring Sep 15 '21 at 03:46
  • "processing millions of files from a directory" - is that millions of files in a single directory? That's can't be good for performance. – sj95126 Sep 15 '21 at 04:11

2 Answers2

0

In Python 3.x, the return value of an mmap object .readline() is a bytes object, so you need to check for end of input with:

if line == b"":

Since your code was checking for a string ("", which was the return in Python 2.x) it loops endlessly.

Note that you'll have to convert each line with .decode() to get the equivalent that you'd get using f.readline().

sj95126
  • 6,520
  • 2
  • 15
  • 34
  • thanks for the pointer. Looks like I have to also `decode` the readline() as string. – Exploring Sep 15 '21 at 03:13
  • 1
    Correct. I'll update the answer to include that as it's relevant as an alternative to ```f.readline()```. – sj95126 Sep 15 '21 at 03:14
  • Note: You could avoid the version-dependent element by just testing `if not line:` which works equally well for detecting Py2 and Py3 empty `str` or `bytes`, and is faster than testing against a specific sentinel to boot. – ShadowRanger Sep 15 '21 at 03:20
0

Corrected code:

lines = []
with open(path, "rt", encoding="utf-8") as f:
    m = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
    while True:
        line = m.readline()
        if line != b"":
           lines.append(line.decode("utf-8"))
        if line == b"":
            break
    m.close()

Updated code:

lines = []
with open(path, "rb", buffering=0) as f:
    m = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
    while True:
        line = m.readline()
        if line:
           lines.append(line.decode("utf-8"))
        else:
            break
    m.close()

Final updated code using python 3.9:

lines = []
with open(path, "rb", buffering=0) as f:
    m = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
    m.madvise(mmap.MADV_SEQUENTIAL)
    while True:
        line = m.readline()
        if line:
           lines.append(line.decode("utf-8"))
        else:
            break
    m.close()
Exploring
  • 2,493
  • 11
  • 56
  • 97
  • 3
    To avoid a doubled test, change `if line == b"":` to just `else:` (tied to the prior `if`). Also, if you're not using the file object for any purpose other than to open the `mmap`, don't bother opening it for reading as text or with a known encoding (that just adds setup overhead for stuff you'll never use); `with open(path, 'rb') as f:` is simpler and lower overhead (arguably, `with open(path, 'rb', buffering=0) as f:` is best, as it removes another layer of user-mode buffering wrapper that you "pay" for and never benefit from; but it's more verbose/complex; simpler can be better). – ShadowRanger Sep 15 '21 at 03:23
  • @ShadowRanger updated the code. Any other feedback? – Exploring Sep 15 '21 at 03:33
  • @Exploring Another readability improvement is to change `if line != b"":` to `if line:` – eyllanesc Sep 15 '21 at 03:36
  • 1
    @Exploring: If you expect to gain any performance from this (and can count on running on 3.8+), I'd suggest putting `m.madvise(mmap.MADV_SEQUENTIAL)` or `m.madvise(mmap.MADV_WILLNEED)` to advise the OS to more aggressively prefetch (so while you're processing lines from the first page, it's fetching the next few, or all of them). On older Python, you might have access to `mmap.MAP_POPULATE` (OS dependent) to pass as a flag when opening the mapping, which is basically a less lazy `MADV_WILLNEED`. I doubt it'll be enough to beat the optimizations for line reading in plain file objects though. – ShadowRanger Sep 15 '21 at 03:47
  • @ShadowRanger learned so much from these comments. Thanks a lot. I will update my python env to 3.9 and would run from there. – Exploring Sep 15 '21 at 03:51
  • @ShadowRanger cannot mmap an empty file. Is there a nice way to handle it? – Exploring Sep 15 '21 at 03:56
  • 1
    @Exploring: Catch the exception when it happens, return empty `list` as file contents? Again, this is another reason to prefer plain file objects; `list(fileobj)` just works, empty or no. – ShadowRanger Sep 15 '21 at 04:44