Apply function to subset of Spark Datasets (Iteratively)

Question

I have a Dataset of geospatial data that I need to sample in a grid like fashion. I want divide the experiment area into a grid, and use a sampling function called "sample()" that takes three inputs, on each square of the grid, and then merge the sampled datasets back together. My current method utilized a map function, but I've learned that you can't have an RDD of RDDs/Datasets/DataFrames. So how can I apply the sampling function to subsets of my dataset? Here is the code I tried to write in map reduce fashion:

val sampleDataRDD = boundaryValuesDS.rdd.map(row => {
    val latMin = row._1
    val latMax = latMin + 0.0001
    val lonMin = row._2
    val lonMax = lonMin + 0.0001
    val filterDF = featuresDS.filter($"Latitude" > latMin).filter($"Latitude"< latMax).filter($"Longitude">lonMin).filter($"Longitude"< lonMin)
    val sampleDS = filterDF.sample(false, 0.05, 1234)
    (sampleDS)
})


val output = sampleDataDS.reduce(_ union _)

I've tried various ways of dealing with this. Converting sampleDS to an RDD and to a List, but I still continue to get a NullPointerExcpetion when calling "collect" on output.

I'm thinking I need to find a different solution, but I don't see it.

I've referenced these questions thus far:

Caused by: java.lang.NullPointerException at org.apache.spark.sql.Dataset

Creating a Spark DataFrame from an RDD of lists

There are many ways to do this, but there's not enough info in the question to determine which one is the best one. One way would be to use `partitionBy` to divide the dataset in grids, but that may not be viable or skewed, it depends on the data distribution. You could also use lists, but only if each grid has a relatively small number of elements. It also depends on what you're trying to do, are you just trying to split the data into squares ? If so, `partitionBy` is the way to go. If not, and you have also to calculate some aggregate, you could write a UDAF — Roberto Congiu, Mar 12 '18 at 21:35
How large is `boundaryValuesDS`? Would it be feasable to collect it to the driver? — Shaido, Mar 13 '18 at 02:20
There's about 350,000 rows in the dataset (with 4 columns), so I think its to big for Lists, and I've tried collecting on the driver which generally crashes. @RobertoCongiu is it possible to `partitionBy` a certain value in the dataset? (i.e. latitude and longitude coordinates) — user306603, Mar 13 '18 at 15:14
It is possible, but again, it depends on the specifics of your data, if the grid is rectangular with non-overlapping tiles and no gaps, all you need is a function `f(x,y)` that produces a number between 0 and the number of tiles. For instance, if your latitudes are between 20 and 40 and longitudes between 50 and 110 and you could use something like `100*(lat -20)*100/(40-20) + (long-50)*100/(110-50)`, this will give you 10,000 tiles, from 0 to 9999, with the first two digits giving you the row number and the last 2 the column number. — Roberto Congiu, Mar 13 '18 at 21:03

Apply function to subset of Spark Datasets (Iteratively)

0 Answers0