0

Our organisation has been using Databricks recently for ETL and development of datasets. However I have found the libraries/capabilities for raster datasets very limiting. There are a few raster/Spark libraries around, but they are not very mature. For example GeoTrellis, RasterFrames and Apache Sedona.

I have therefore been exploring alternative ways to efficiently work with raster data on the Databricks platform, which leverages Spark / Delta tables / Parquet files.

One idea I had was to dump the raster data to simple x, y, value columns and load them as tables. Providing my other datasets are of the same resolution (I will pre-process them so that they are), I should then be able to do simple SQL queries for masking/addition/subtraction and more complex user-defined functions.

Step one, I thought would be to dump my raster to points as a CSV, and then I can load to a Delta table. But after 12 hours running on my Databricks cluster (128GB memory, 16 cores), a 3GB raster had still not finished (I was using the gdal2xyz function below).

Does anyone have a quicker way to dump a raster to CSV? Or even better, directly to parquet format.

python gdal2xyz.py -band 1 -skipnodata "AR_FLRF_UD_Q1500_RD_02.tif" "AR_FLRF_UD_Q1500_RD_02.csv"

Maybe I can tile the raster, dump each CSV to file using parallel processing, and then bind the CSV files together but it seems a bit laborious.

TheRealJimShady
  • 777
  • 3
  • 9
  • 24

2 Answers2

2

You can use Sedona to easily load GeoTiffs to DataFrame and save the dataframe as Parquet format. See here: https://sedona.apache.org/latest-snapshot/api/sql/Raster-loader/

1

GDAL has a parquet driver since version 3.5. So, with at least that version, you should be able to write raster data to parquet with "terra" like this

library(terra)
x <- rast(ncol=10, nrow=10, val=1:100)
writeRaster(x, "file.pqt", driver="Parquet")

You can check the version that "terra" uses with terra::gdal(). The current CRAN release for windows is not there yet (but almost)

gdal()
#[1] "3.4.3"
Robert Hijmans
  • 40,301
  • 4
  • 55
  • 63