Regular expression parsing (streaming) a binary file?

Question

I'm trying to implement a strings(1)-like function in Python.

import re

def strings(f, n=5):
    # TODO: support files larger than available RAM
    return re.finditer(br'[!-~\s]{%i,}' % n, f.read())

if __name__ == '__main__':
    import sys
    with open(sys.argv[1], 'rb') as f:
        for m in strings(f):
            print(m[0].decode().replace('\x0A', '\u240A'))

Setting aside the case of actual matches* that are larger than the available RAM, the above code fails in the case of files that are merely, themselves, larger than the available RAM!

An attempt to naively "iterate over f" will be done linewise, even for binary files; this may be inappropriate because (a) it may return different results than just running the regex on the whole input, and (b) if the machine has 4 gigabytes of RAM and the file contains any match for rb'[^\n]{8589934592,}', then that unasked-for match will cause a memory problem anyway!

Does Python's regex library enable any simple way to stream re.finditer over a binary file?

*I am aware that it is possible to write regular expressions that may require an exponential amount of CPU or RAM relative to their input length. Handling these cases is, obviously, out-of-scope; I'm assuming for the purposes of the question that the machine at least has enough resource to handle the regex, its largest match on the input, the acquisition of this match, and the ignoring-of all nonmatches.

Not a duplicate of Regular expression parsing a binary file? -- that question is actually asking about bytes-like objects; I am asking about binary files per se.
Not a duplicate of Parse a binary file with Regular Expressions? -- for the same reason.
Not a duplicate of Regular expression for binary files -- that question only addressed the special case where offsets of all matches were known beforehand.
Not a duplicate of Regular expression for binary files -- combination of both of these reasons.

JamesTheAwesomeDude · Accepted Answer · 2022-10-12T19:12:44.267

Does Python's regex library enable any simple way to stream re.finditer over a binary file?

Well, while typing up the question in such excruciating detail and getting suppporting documentation, I found the solution:

mmap — Memory-mapped file support

Memory-mapped file objects behave like both bytearray and like file objects. You can use mmap objects in most places where bytearray are expected; for example, you can use the re module to search through a memory-mapped file. …

Enacted:

import re, mmap

def strings(f, n=5):
    view = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    return re.finditer(br'[!-~\s]{%i,}' % n, view)

Caveat: on 32-bit systems, this might not work for files larger than 2GiB, if the underlying standard library is deficient.

However, it looks like it should be fine on both Windows and any well-maintained Linux distribution:

13.8 Memory-mapped I/O

Since mmapped pages can be stored back to their file when physical memory is low, it is possible to mmap files orders of magnitude larger than both the physical memory and swap space. The only limit is address space. The theoretical limit is 4GB on a 32-bit machine - however, the actual limit will be smaller since some areas will be reserved for other purposes. If the LFS interface is used the file size on 32-bit systems is not limited to 2GB … the full 64-bit [8 EiB] are available. …

Creating a File Mapping Using Large Pages

… you must specify the FILE_MAP_LARGE_PAGES flag with the MapViewOfFile function to map large pages. …

Regular expression parsing (streaming) a binary file?

1 Answers1

`mmap` — Memory-mapped file support

13.8 Memory-mapped I/O

Creating a File Mapping Using Large Pages

Regular expression parsing (streaming) a binary file?

1 Answers1

mmap — Memory-mapped file support

13.8 Memory-mapped I/O

Creating a File Mapping Using Large Pages

`mmap` — Memory-mapped file support