I'm trying to use Spark (PySpark) to in order to run analysis on data stored in multi-band GeoTiffs. I'm still somewhat of a Spark newbie.
Setup:
The geotiffs themselves are small enough to run in pure python - specifically I am using gdal to read the data. I then create dataframes and do the analysis.
However the analysis takes awhile. And, on a recurring basis, I will have hundreds of geotiffs to analyze - enter PySpark.
Problem:
I've written code that allows me to run the analysis on a local pseudo-cluster. However it fails with proper cluster because the data stored on the master node can not be read locally by the worker nodes.
HDFS should come to the rescue, however sc.textFile(..)
returns the raw unprocessed content of the geotiff which isn't very useful.
I could preprocess the data to turn the geotiffs into csv's but the additional overhead might make it not worthwhile.
The two solutions I had hope to find were
- A spark method, lets call it
sc.rasterFile(...)
, which would read a geotiff into a spark data frame or rdd The ability to access hdfs from pure python methods. Something like ~
gdal.Open("hdfs://...", gc.GA_ReadOnly)
Questions:
- Am I correct to assume that neither of the above solutions are possible?
- Are there other tools/methods/apis for working with tiffs in Spark?
Thanks!