How to efficiently read part of a huge tiff image?

Question

I have a collection of rather large tiff files (typical resolution is 30k x 30k to 50k x 50k). These have some interesting properties: 1. they contain several different resolutions of the same image, and 2. the data seems to be compressed as JPEG tiles (512x512). They also seem to be BigTIFFs.

I'd like to read specific regions of interest (more or less random access pattern) and I'd like it to be as fast as possible (~about as fast as decompressing the needed tiles plus a little bit of overhead). Decompressing the whole file, or even one of the resolution layers, is not practical. I'm using python.

I tried opening with skimage.io.MultiImage('test.tiff'))[level], and this works (produces correct data). It does decompress each resolution layer completely, although not the whole file; so okay at the low resolutions but not really useful for the highest resolutions. There didn't seem to be any way to select a region of interest in skimage.io, or in any of the libraries it uses (imread, PIL, ...).

I also tried using OpenSlide using img.read_region((x,y), level, (width, height)). This library seems made exactly for this type of data, and is very fast, but unfortunately produces incorrect data for some regions. Until the bug is fixed upstream, I can't use it.

Lastly, using a very recent version of tifffile=2020.6.3 and imagecodecs=2020.5.30 (older versions don't work - I think at least 2018.10.18 is needed), I could list the tiles, using code modified from the tifffile documentation:

with tifffile.TiffFile('test.tiff') as tif:
    fh = tif.filehandle
    for page in tif.pages:
        jpegtables = page.tags.get('JPEGTables', None)
        if jpegtables is not None:
            jpegtables = jpegtables.value
        for index, (offset, bytecount) in enumerate(
            zip(page.dataoffsets, page.databytecounts)
        ):
            fh.seek(offset)
            data = fh.read(bytecount)
            tile, indices, shape = page.decode(data, index, jpegtables)
            print(tile.shape, indices, shape)

It seems the page.decode() call actually decompresses each tile (tile is a numpy array with pixel data). It is not obvious how to only get the index but not decompress. I'm also not sure how fast this would be. This leaves any selection of a region of interest and reassembly of tiles as an exercise to the user.

How do I efficiently read regions of interest out of files like this? Does someone have example code to do that with tifffile? Or, is there another library that would do the trick?

Try https://gist.github.com/rfezzani/b4b8852c5a48a901c1e94e09feb34743 — cgohlke, Jul 06 '20 at 11:23

How to efficiently read part of a huge tiff image?

0 Answers0