I'm trying to implement a strings(1)
-like function in Python.
import re
def strings(f, n=5):
# TODO: support files larger than available RAM
return re.finditer(br'[!-~\s]{%i,}' % n, f.read())
if __name__ == '__main__':
import sys
with open(sys.argv[1], 'rb') as f:
for m in strings(f):
print(m[0].decode().replace('\x0A', '\u240A'))
Setting aside the case of actual matches* that are larger than the available RAM, the above code fails in the case of files that are merely, themselves, larger than the available RAM!
An attempt to naively "iterate over f
" will be done linewise, even for binary files; this may be inappropriate because (a) it may return different results than just running the regex on the whole input, and (b) if the machine has 4 gigabytes of RAM and the file contains any match for rb'[^\n]{8589934592,}'
, then that unasked-for match will cause a memory problem anyway!
Does Python's regex library enable any simple way to stream re.finditer
over a binary file?
*I am aware that it is possible to write regular expressions that may require an exponential amount of CPU or RAM relative to their input length. Handling these cases is, obviously, out-of-scope; I'm assuming for the purposes of the question that the machine at least has enough resource to handle the regex, its largest match on the input, the acquisition of this match, and the ignoring-of all nonmatches.
Not a duplicate of Regular expression parsing a binary file? -- that question is actually asking about bytes-like objects; I am asking about binary files per se.
Not a duplicate of Parse a binary file with Regular Expressions? -- for the same reason.
Not a duplicate of Regular expression for binary files -- that question only addressed the special case where offsets of all matches were known beforehand.
Not a duplicate of Regular expression for binary files -- combination of both of these reasons.