0

I have three GeoTIFFs, each roughly 500 MB in size on AWS' S3, which I am trying to process on an EMR cluster using Dask, but I obtain a MemoryError after the processing the first tiff.

After reading the GeoTIFF using xarray.open_rasterio(), I convert the grid values to boolean then multiply the array by a floating point value. This workflow has executed successfully on three GeoTIFFs 50 MBs in size. Additionally, I have tried using chunking when reading with xarray, but have obtained the same results.

Is there a size limitation with Dask or another possible issue I could be running into?

jwx
  • 137
  • 1
  • 10
  • Don't know about the EMR specifications, but on a normal machine this file size and operation should not be a problem whatsoever, normal or chunked. Can you post your code? – Christoph Rieke Jul 07 '19 at 17:33

1 Answers1

0

Is there a size limitation with Dask or another possible issue I could be running into?

Dask itself does not artificially impose any size limitations. It is just a normal Python process. I recommend thinking about normal Python or hardware issues. My first guess would be that you're using very small VMs, but that's just a guess. Good luck!

MRocklin
  • 55,641
  • 23
  • 163
  • 235