0

I need to read chunks of 64KB in loop, and process them, but stop at the end of file minus 16 bytes: the last 16 bytes are a tag metadata.

The file might be super large, so I can't read it all in RAM.

All the solutions I find are a bit clumsy and/or unpythonic.

with open('myfile', 'rb') as f:
    while True:
        block = f.read(65536)
        if not block:
            break
        process_block(block)

If 16 <= len(block) < 65536, it's easy: it's the last block ever. So useful_data = block[:-16] and tag = block[-16:]

If len(block) == 65536, it could mean three things: that the full block is useful data. Or that this 64KB block is in fact the last block, so useful_data = block[:-16] and tag = block[-16:]. Or that this 64KB block is followed by another block of only a few bytes (let's say 3 bytes), so in this case: useful_data = block[:-13] and tag = block[-13:] + last_block[:3].

How to deal with this problem in a nicer way than distinguishing all these cases?

Note:

  • the solution should work for a file opened with open(...), but also for a io.BytesIO() object, or for a distant SFTP opened file (with pysftp).

  • I was thinking about getting the file object size, with

    f.seek(0,2)
    length = f.tell()
    f.seek(0)
    

    Then after each

    block = f.read(65536)
    

    we can know if we are far from the end with length - f.tell(), but again the full solution does not look very elegant.

Basj
  • 41,386
  • 99
  • 383
  • 673

2 Answers2

1

you can just read in every iteration min(65536, L-f.tell()-16)

Something like this:

from pathlib import Path

L = Path('myfile').stat().st_size

with open('myfile', 'rb') as f:
    while True:    
        to_read_length = min(65536, L-f.tell()-16)
        block = f.read(to_read_length)
        process_block(block)
        if f.tell() == L-16
            break

Did not ran this, but hope you get the gist of it.

Lior Cohen
  • 5,570
  • 2
  • 14
  • 30
  • Nice solution! I'd replace `st_size` with `f.seek(0, 2); L = f.tell(); f.seek(0)` so that it works for any file object, not only for filesystem files. – Basj Nov 22 '20 at 20:34
  • I'm hesitating about writing the `tag` at the beginning of the file (thus requiring a few `f.seek()` because the tag is only computed when the rest of the file is written) or keep it at the end + use this solution. What would you do in [this situation](https://stackoverflow.com/questions/64951915/encrypt-a-big-file-that-does-not-fit-in-ram-with-aes-gcm/64959223#64959223) @LiorCohen? – Basj Nov 22 '20 at 20:36
  • if you know in advance what room you need for your tag, I would put the tag in the beginning. and when I know it seek back to 0, and write it. Just be careful not to overwrite the first block. – Lior Cohen Nov 22 '20 at 20:52
1

The following method relies only on the fact that the f.read() method returns an empty bytes object upon end of stream (EOS). It thus could be adopted for sockets simply by replacing f.read() with s.recv().

def read_all_but_last16(f):
    rand = random.Random()  #  just for testing
    buf = b''
    while True:
        bytes_read = f.read(rand.randint(1, 40))  # just for testing
        # bytes_read = f.read(65536)
        buf += bytes_read
        if not bytes_read:
            break
        process_block(buf[:-16])
        buf = buf[-16:]
    verify(buf[-16:])

It works by always leaving 16 bytes at the end of buf until EOS, then finally processing the last 16. Note that if there aren't at least 17 bytes in buf then buf[:-16] returns the empty bytes object.

President James K. Polk
  • 40,516
  • 21
  • 95
  • 125