3

I'm storing .tiff files on google cloud storage. I'd like to manipulate them using a distributed Dask cluster installed with Helm on Kubernetes..

Based on the dask-image repo, the Dask documentation on remote data services, and the use of storage_options, right now it looks like remote reads from .zarr, .tdb, .orc, .txt, .parquet, and .csv formats are supported. Is that correct? If so, is there any recommended workaround for accessing remote .tiff files?

skeller88
  • 4,276
  • 1
  • 32
  • 34

2 Answers2

2

There are many ways to do this. I would probably use a library like skimage.io.imread along with dask.delayed to read the TIFF files in parallel and then arrange them into a Dask Array

I encourage you to take a look at this blogpost on loading image data with Dask, which does something similar.

I believe that the skimage.io.imread function will happily read data from a URL, although it may not know how to interoperate with GCS. If the data on GCS is also available by a public URL (this is easy to do if you have access to the GCS bucket) then that would be easy. Otherwise you might use the gcsfs library to get the bytes from the file and then feed those bytes into some Python image reader.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
2

Building off @MRocklin's answer, I found two ways to do it with gcsfs. One way with imageio for image parsing:

fs = gcsfs.GCSFileSystem(project="project_name")
img_bytes = fs.cat("bucket/blob_name.tif")
imageio.core.asarray(imageio.imread(img_bytes, "TIFF"))

And another with opencv-python for image parsing:

fs = gcsfs.GCSFileSystem(project="project_name")
fs.get("bucket/blob_name.tif", "local.tif")
img = np.asarray(cv2.imread("local.tif", cv2.IMREAD_UNCHANGED))
skeller88
  • 4,276
  • 1
  • 32
  • 34