Passing a function into the creation of an RDD

Question

Let's start:

I have a dataframe (trainingDataFrame) which comes from some spatial data. Each row of the dataframe has these columns: point_id(_c0), x_coord(_c1), y_coord(_c2), point_class(_c3).

 +---+---+---+---+
 |_c0|_c1|_c2|_c3|
 +---+---+---+---+
 |1  |0.0|0.0|a  |
 |2  |0.0|1.0|a  |
 |3  |1.0|0.0|b  |
 |4  |3.0|4.0|b  |
 |5  |8.0|7.0|b  |
 |6  |4.0|9.0|b  |
 |7  |2.0|5.0|a  |
 |8  |1.0|9.0|a  |
 |9  |3.0|6.0|a  |
 |10 |8.0|2.0|c  |
 |11 |9.0|1.0|a  |
 |12 |2.0|7.0|c  |
 |13 |2.0|9.0|c  |
 |14 |2.0|4.0|b  |
 |15 |1.0|3.0|c  |
 |16 |4.0|6.0|c  |
 |17 |3.0|5.0|c  |
 |18 |5.0|3.0|a  |
 |19 |5.0|9.0|b  |
 |20 |8.0|9.0|c  |
 +---+---+---+---+

I have created a function that takes a x_coord and a y_coord of any given point and returns the cell that the particular point belongs in space (there are 4 cells).

 def icchId(X : Double, Y : Double, F_avgX: Double, F_avgY : Double) : Any = {
     if(X < F_avgX && Y < F_avgY){
       return "ICCH 1"
     }
     else if(X < F_avgX && Y >= F_avgY){
       return "ICCH 2"
     }
     else if(X >= F_avgX && Y >= F_avgY){
       return "ICCH 3"
     }
     else if(X > F_avgX && Y < F_avgY){
       return "ICCH 4"
     }
     else 
       return 0
   }

My goal is to create an RDD that each of its rows has this form:

[point_ICCHid] (the icchId functions return value), [x_coord - y_coord] (key), [point_class](value)

The point_ICCHid will be provided by the icchId function. The x and y_coord from the dataframe as well as the class of each point.

My attempt is the one shown below:

val trainingRDD : RDD[Row] = trainingDataFrame.rdd.map(r => (icchId(r(1),r(2),avgX,avgY),(r(1),r(2),r(3))) )

but I get this error:

error: type mismatch; found : Any required: Double val trainingRDD : RDD[Row] = trainingDataFrame.rdd.map(r => (icchId(r(1),r(2),avgX,avgY),(r(1),r(2),r(3))) )

Note that I am using databricks community edition for this project and also that I am trying to pass into the creation of my RDD a custom function.

EDIT:

After some tweaks in the creation of the RDD based on the answered given in the comments about a similar question I came up with this edit in the line of code:

val trainingRDD : RDD[Row] = trainingDataFrame.rdd.map(r => r.icchId(1,2,avgX,avgY))

Now the error I think is much worse than before:

error: value icchId is not a member of org.apache.spark.sql.Row

Isn't he just trying to convert his data to Int? I want to pass into the creation of my RDD a custom function. — Aris Kantas, May 18 '19 at 19:58
It looks like the same problem--imagine `_.toInt` is some function `f(x: Double) = x.toInt`. — hoyland, May 18 '19 at 21:21
In particular https://stackoverflow.com/a/33009205/424173 answers your question--when you access by index as in your example, you get an `Any` and you need a `Double`. — hoyland, May 18 '19 at 21:24
After some changes based on the answer you gaved me I get this error: `error: value icchId is not a member of org.apache.spark.sql.Row`. Here is the new line of code: `val trainingRDD : RDD[Row] = trainingDataFrame.rdd.map(r => r.icchId(1,2,avgX,avgY)) ` — Aris Kantas, May 19 '19 at 08:01

Passing a function into the creation of an RDD

0 Answers0