0

I have a Dataset of geospatial data that I need to sample in a grid like fashion. I want divide the experiment area into a grid, and use a sampling function called "sample()" that takes three inputs, on each square of the grid, and then merge the sampled datasets back together. My current method utilized a map function, but I've learned that you can't have an RDD of RDDs/Datasets/DataFrames. So how can I apply the sampling function to subsets of my dataset? Here is the code I tried to write in map reduce fashion:

val sampleDataRDD = boundaryValuesDS.rdd.map(row => {
    val latMin = row._1
    val latMax = latMin + 0.0001
    val lonMin = row._2
    val lonMax = lonMin + 0.0001
    val filterDF = featuresDS.filter($"Latitude" > latMin).filter($"Latitude"< latMax).filter($"Longitude">lonMin).filter($"Longitude"< lonMin)
    val sampleDS = filterDF.sample(false, 0.05, 1234)
    (sampleDS)
})


val output = sampleDataDS.reduce(_ union _)

I've tried various ways of dealing with this. Converting sampleDS to an RDD and to a List, but I still continue to get a NullPointerExcpetion when calling "collect" on output.

I'm thinking I need to find a different solution, but I don't see it.

I've referenced these questions thus far:

Caused by: java.lang.NullPointerException at org.apache.spark.sql.Dataset

Creating a Spark DataFrame from an RDD of lists

user306603
  • 11
  • 3
  • There are many ways to do this, but there's not enough info in the question to determine which one is the best one. One way would be to use `partitionBy` to divide the dataset in grids, but that may not be viable or skewed, it depends on the data distribution. You could also use lists, but only if each grid has a relatively small number of elements. It also depends on what you're trying to do, are you just trying to split the data into squares ? If so, `partitionBy` is the way to go. If not, and you have also to calculate some aggregate, you could write a UDAF – Roberto Congiu Mar 12 '18 at 21:35
  • How large is `boundaryValuesDS`? Would it be feasable to collect it to the driver? – Shaido Mar 13 '18 at 02:20
  • There's about 350,000 rows in the dataset (with 4 columns), so I think its to big for Lists, and I've tried collecting on the driver which generally crashes. @RobertoCongiu is it possible to `partitionBy` a certain value in the dataset? (i.e. latitude and longitude coordinates) – user306603 Mar 13 '18 at 15:14
  • It is possible, but again, it depends on the specifics of your data, if the grid is rectangular with non-overlapping tiles and no gaps, all you need is a function `f(x,y)` that produces a number between 0 and the number of tiles. For instance, if your latitudes are between 20 and 40 and longitudes between 50 and 110 and you could use something like `100*(lat -20)*100/(40-20) + (long-50)*100/(110-50)`, this will give you 10,000 tiles, from 0 to 9999, with the first two digits giving you the row number and the last 2 the column number. – Roberto Congiu Mar 13 '18 at 21:03

0 Answers0