1

I would like to use re module with streams, but not necessarily file streams, at minimal development cost.

For file streams, there's mmap module that is able to impersonate a string and as such can be used freely with re.

Now I wonder how mmap manages to craft an object that re can further reuse. If I just pass whatever, re protect itself against usage of too incompatible objects with TypeError: expected string or bytes-like object. So I thought I'd create a class that derives from string or bytes and override a few methods such as __getitem__ etc. (this intuitively fits the duck typing philosophy of Python), and make them interact with my original stream. However, this doesn't seem to work at all - my overrides are completely ignored.

Is it possible to create such a "lazy" string in pure Python, without C extensions? If so, how?

A bit of background to disregard alternative solutions:

  • Can't use mmap (the stream contents are not a file)
  • Can't dump the whole thing to the HDD (too slow)
  • Can't load the whole thing to the memory (too large)
  • Can seek, know the size and compute the content at runtime

Example code that demonstrates bytes resistance to modification:

class FancyWrapper(bytes):
    def __init__(self, base_str):
        pass #super() isn't called and yet the code below finds abc, aaa and bbb

print(re.findall(b'[abc]{3}', FancyWrapper(b'abc aaa bbb def')))
rr-
  • 14,303
  • 6
  • 45
  • 67
  • What __are__ your streams? And can you post the code you tried that derives from `str`? – bbayles Feb 28 '16 at 14:48
  • @bbayles my streams contain "data ranges" - each "data range" may take data from the memory, or from a file on the HDD. There's a function that combines information from such range offsets into one linear memory when asked for data at specific offset. Basically it's an approach to handle editing huge files. Edited the post to provide the most basic example. – rr- Feb 28 '16 at 15:31
  • Maybe I'm thick, but please do elaborate why you cannot simply iterate over your stream? With any file we'd do a `for line in fh: ... re.search(line, pattern) ...`. For other things than Files use simple code patterns like [this one for string streams](http://stackoverflow.com/questions/21843693/creating-stream-to-iterate-over-from-string-in-python). This should be easily possible if you can seek in your data. – cfi Feb 28 '16 at 18:57
  • I don't iterate over stream this way because it contains binary data so it's difficult to choose boundary such as `\n`. Even for plain text files, you'd want to be able to do multiline searches as well. Basically, choosing any artificial boundary affects the regex behavior and leads to match inconsistencies. Example: user tries to find `\x00{20000,}` which is well present in the stream, but the match is never shown because the "iterated page size" (for lack of better terms) was too small to hold everything at once. This makes the user believe the stream never contained such sequence. – rr- Feb 28 '16 at 20:26

1 Answers1

2

Well, I found out that it's not possible, not currently.

  1. Python's re module internally operates on the strings in the sense that it scans through a plain C buffer, which requires the object it receives to satisfy these properties:

    • Their representation must reside in the system memory,
    • Their representation must be linear, e.g. it cannot contain gaps of any sort,
    • Their representation must contain the content we're searching in as a whole.

    So even if we managed to make re work with something else than bytes or string, we'd have to use mmap-like behavior, i.e. impersonate our content provider as linear region in the system memory.

  2. But the mmap mechanism will work only for files, and in fact, even this is also pretty limited. For example, one can't mmap a large file if one tries to write to it, as per this answer.

  3. Even the regex module, which contains many super duper additions such as (?r), doesn't accommodate for content sources outside string and bytes.

For completeness: does this mean we're screwed and can't scan through large dynamic content with re? Not necessarily. There's a way to do it, if we permit a limit on max match size. The solution is inspired by cfi's comment, and extends it to binary files.

  1. Let n = max match size.
  2. Start a search at position x
  3. While there's content:
    1. Navigate to position x
    2. Read 2*n bytes to scan buffer
    3. Find the first match within scan buffer
    4. If match was found:
      1. Let x = x + match_pos + match_size
      2. Notify about the match_pos and match_size
    5. If match wasn't found:
      1. Let x = x + n

What this accomplishes by using twice as big buffer as the max match size? Imagine the user searches for A{3} and the max match size is set to 3. If we'd read just max match size bytes to the scan buffer and the data at current x contained AABBBA:

  1. This iteration would look at AAB. No match.
  2. The next iteration would move the pointer to x+3.
  3. Now the scan buffer would look like this: BBA. Still no match.

This is obviously bad, and the simple solution is to read twice as many bytes as we jump over, to ensure the anomaly near the scan buffer's tail is resolved.

Note that the short-circuiting on the first match within the scan buffer is supposed to protect against other anomalies such as buffer underscans. It could probably be tweaked to minimize reads for scan buffers that contain multiple matches, but I wanted to avoid further complicating things.

This probably isn't the most performant algorithm made, but is good enough for my use case, so I'm leaving it here.

Community
  • 1
  • 1
rr-
  • 14,303
  • 6
  • 45
  • 67