25

Is seems that the mmap interface only supports readline(). If I try to iterate over the object I get character instead of complete lines.

What would be the "pythonic" method of reading a mmap'ed file line by line?

import sys
import mmap
import os


if (len(sys.argv) > 1):
  STAT_FILE=sys.argv[1]
  print STAT_FILE
else:
  print "Need to know <statistics file name path>"
  sys.exit(1)


with open(STAT_FILE, "r") as f:
  map = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
  for line in map:
    print line # RETURNS single characters instead of whole line
martineau
  • 119,623
  • 25
  • 170
  • 301
Maxim Veksler
  • 29,272
  • 38
  • 131
  • 151
  • 1
    Out of interest, what's the motivation for using a memory-mapped file for this, as opposed to a normal file? – NPE Nov 16 '11 at 13:25
  • 2
    @aix: I could possibly have GB's of raw data, and I would like to access them in the most efficient method possible. But the real reason is: It's cooler :) – Maxim Veksler Nov 16 '11 at 15:33
  • 1
    I don't know whether it's cooler, but you shouldn't simply assume that it's faster (if you really care, you ought to profile). – NPE Nov 16 '11 at 15:36
  • 1
    I added some timings to my post below. – hochl Nov 16 '11 at 16:14

5 Answers5

36

The most concise way to iterate over the lines of an mmap is

with open(STAT_FILE, "r+b") as f:
    map_file = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
    for line in iter(map_file.readline, b""):
        # whatever

Note that in Python 3 the sentinel parameter of iter() must be of type bytes, while in Python 2 it needs to be a str (i.e. "" instead of b"").

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • 3
    I didn't know `iter` took this `callable`/`sentinel` argument pair. +1 and removed my answer in favor of this one. – Fred Foo Nov 16 '11 at 13:37
  • And please change the open mode to `r+b` instead of `r` (as mentioned in my post below). – hochl Nov 16 '11 at 13:59
  • For Windows, use `access=mmap.ACCESS_READ`, see: [Loading file in memory](https://stackoverflow.com/q/13500434/55075). – kenorb Jun 21 '19 at 22:19
  • @SvenMarnach could you please explain why on readline you use the b"" second parameter? Thanks – Gerasimos Ragavanis Aug 29 '19 at 16:43
  • 1
    @GerasimosRagavanis The two-argument version of `iter()` basically means: call the function in the first argument repeatedly and yield the successive return values, but stop once the sentinel in the second argument is returned. So we basically call `map_file.readline()` until it doesn't return any more data. For regular files you could simply write `for line in file`, but `mmap` does not support line iteration directly, so we need to use `iter()`. – Sven Marnach Aug 29 '19 at 19:55
  • 1
    @SvenMarnach How can I get the count of lines in a big file to avoid memory issues using mmap ? – Kar Jul 08 '20 at 11:39
  • @Kar Your question does not have enough context for me to answer it, and it also seems unrelated, so I suggest you ask a new question. For what it's worth, there are numerous questions here on how to count the number of items in an iterator, e.g. this one: https://stackoverflow.com/election. As long as you don't store the whole file in memory at once, there shouldn't be any memory issues, regardless of whether you use `mmap()` or `open()`. – Sven Marnach Jul 08 '20 at 20:36
15

I modified your example like this:

with open(STAT_FILE, "r+b") as f:
        m=mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
        while True:
                line=m.readline()
                if line == '': break
                print line.rstrip()

Suggestions:

Hope this helps.

Edit: I did some timing tests on Linux because the comment made me curious. Here is a comparison of timings made on 5 sequential runs on a 137MB text file.

Normal file access:

real    2.410 2.414 2.428 2.478 2.490
sys     0.052 0.052 0.064 0.080 0.152
user    2.232 2.276 2.292 2.304 2.320

mmap file access:

real    1.885 1.899 1.925 1.940 1.954
sys     0.088 0.108 0.108 0.116 0.120
user    1.696 1.732 1.736 1.744 1.752

Those timings do not include the print statement (I excluded it). Following these numbers I'd say memory mapped file access is quite a bit faster.

Edit 2: Using python -m cProfile test.py I got the following results:

5432833    2.273    0.000    2.273    0.000 {method 'readline' of 'file' objects}
5432833    1.451    0.000    1.451    0.000 {method 'readline' of 'mmap.mmap' objects}

If I'm not mistaken then mmap is quite a bit faster.

Additionally, it seems not len(line) performs worse than line == '', at least that's how I interpret the profiler output.

martineau
  • 119,623
  • 25
  • 170
  • 301
hochl
  • 12,524
  • 10
  • 53
  • 87
  • `AttributeError: 'mmap.mmap' object has no attribute 'readlines'` – Fred Foo Nov 16 '11 at 12:33
  • 1
    hochl: Thank you. The benchmarks are great. Could you attach a script to reproduce the test and confirm the analysis? – Maxim Veksler Nov 16 '11 at 16:33
  • 2
    I simply commented out the print in your program and then did `time test.py` like 10 times, then took the 5 middle values. It would be interesting to check the results of `python -m cProfile test.py`. – hochl Nov 16 '11 at 16:51
1

The following is reasonably concise:

with open(STAT_FILE, "r") as f:
    m = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
    while True:
        line = m.readline()  
        if line == "": break
        print line
    m.close()

Note that line retains the newline, so you might like to remove it. It is also the reason why if line == "" does the right thing (an empty line is returned as "\n").

The reason the original iteration works the way it does is that mmap tries to look like both a file and a string. It looks like a string for the purposes of iteration.

I have no idea why it can't (or chooses not to) provide readlines()/xreadlines().

NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • The `readlines()` method of file objects returns a list of all lines of the file. doing this on an mmapped file would completely defeat the purpose of the mmap. – Sven Marnach Nov 16 '11 at 13:04
  • @SvenMarnach: It could be a generator. In any case, to be totally honest I fail to see the need for memory-mapped files in this entire question. – NPE Nov 16 '11 at 13:28
  • You are completely right. So maybe the reason for the non-existence of such a generator is that it would be pointless. :) – Sven Marnach Nov 16 '11 at 13:32
0

Python 2.7 32bit on Windows is more than twice as fast on an mmapped file:

On a 27MB, 509k line text file (my 'parse' function is not interesting it mostly just readline()'s very rapidly):

with open(someFile,"r") as f:
    if usemmap:
        m=mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    else:
        m=f
        e.parse(m)

With MMAP:

read in 0.308000087738

Without MMAP:

read in 0.680999994278
Michael
  • 3,093
  • 7
  • 39
  • 83
-1

Even better in case you get error with mmap():

with open('/content/drive/MyDrive......', "r+b") as f:
    # map_file = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) as mmap not recogn. import something
    for line in iter(f.readline, b""):
      print(line)
Destroy666
  • 892
  • 12
  • 19
  • Answer needs supporting information Your answer could be improved with additional supporting information. Please [edit](https://stackoverflow.com/posts/76319805/edit) to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](https://stackoverflow.com/help/how-to-answer). – moken May 27 '23 at 03:50