PySpark and Raster (GeoTiff) data

Question

I'm trying to use Spark (PySpark) to in order to run analysis on data stored in multi-band GeoTiffs. I'm still somewhat of a Spark newbie.

Setup:

The geotiffs themselves are small enough to run in pure python - specifically I am using gdal to read the data. I then create dataframes and do the analysis.

However the analysis takes awhile. And, on a recurring basis, I will have hundreds of geotiffs to analyze - enter PySpark.

Problem:

I've written code that allows me to run the analysis on a local pseudo-cluster. However it fails with proper cluster because the data stored on the master node can not be read locally by the worker nodes.

HDFS should come to the rescue, however sc.textFile(..) returns the raw unprocessed content of the geotiff which isn't very useful.

I could preprocess the data to turn the geotiffs into csv's but the additional overhead might make it not worthwhile.

The two solutions I had hope to find were

A spark method, lets call it sc.rasterFile(...), which would read a geotiff into a spark data frame or rdd
The ability to access hdfs from pure python methods. Something like ~

gdal.Open("hdfs://...", gc.GA_ReadOnly)

Questions:

Am I correct to assume that neither of the above solutions are possible?
Are there other tools/methods/apis for working with tiffs in Spark?

Thanks!

zero323 · Answer 1 · 2016-07-14T17:51:57.230

If you want to read and process whole files the simplest approach is to combine binaryFiles and io module:

from io import BytesIO

(sc
    .binaryFiles(path)
    .values()
    .map(BytesIO)
    .map(some_function_which_expects_opened_binary_file))

Regarding your remaining queries:

There is no sc.rasterFile but there are some image processing input formats provided by Hipi which can be used in Spark with SparkContext.hadoopFile.
It is possible to read directly from Posix compliant distributed file system or even HDFS (for example with hdfs3) but these methods won't take into account data locality and can result in suboptimal performance.

PySpark and Raster (GeoTiff) data

1 Answers1