0

I have a rather naive question about speed performance of reading from a file in python. I am implementing an application which needs to read from a binary file. The data is organised in blocks (events, all occupying the same space and encoding the same type of information) where for each event it might not be necessary to read all info. I have written a function using memory mapping to maintain a file "pointer" which is used to seek() and read() from the file.

self.f = open("myFile", "rb")

self._mmf = mmap.mmap(self.f.fileno(), length=0, access=mmap.ACCESS_READ)

Since I know which bytes correspond to which info in the event, I was thinking of implementing a function to receive what type of information is needed from the event, seek() to that position and read() only the relevant bytes, then reposition the file pointer at the beginning of the event for eventual following calls (which might or might not be needed).

My question is, is this implementation necessarily expected to be slower wrt reading the entire event once (and eventually use this info only partially) or does it depend on the event size compared to how many calls of seek() I might have in the workflow?

Thanks!

FredS
  • 3
  • 2
  • The actual low-level OS calls are generally a block at a time. Doing several smaller reads of chunks smaller than a block is going to add more back and forth between userspace and kernel space as opposed to just reading an (aligned) block at a time. – Charles Duffy Dec 12 '21 at 03:41
  • 1
    On the other hand, if you care so much about performance to need to know that, Python is probably the wrong language for your project in the first place. – Charles Duffy Dec 12 '21 at 03:42
  • That said, mmap will page in a block at a time regardless; when you use mmap, you _aren't_ reading smaller-than-a-block pieces from disk at all, because everything important happens with page-level granularity. – Charles Duffy Dec 12 '21 at 03:43
  • By the way, are you familiar with `readat()` and related syscalls? A modern kernel can combine seek and read already. – Charles Duffy Dec 12 '21 at 03:45
  • To put what I was saying above differently -- with mmap, there aren't any `read()` or `seek()` syscalls at all, so talking about their performance makes no sense. You just get a bunch of virtual memory pages where the segfault handler copies in data from the linked file. – Charles Duffy Dec 13 '21 at 13:26
  • Thank you! This is mostly the answer I was looking for. – FredS Dec 14 '21 at 22:55
  • BTW, another note about using mmap in read-only mode: If the system is running low on RAM, unlike normal memory contents it can need to copy to swap, for an mmap'd file it can just discard the pages outright; if they're needed again later, the segfault handler gets called again and copies the data back in. – Charles Duffy Dec 15 '21 at 01:11

0 Answers0