Let's start:
I have a dataframe (trainingDataFrame) which comes from some spatial data. Each row of the dataframe has these columns: point_id(_c0), x_coord(_c1), y_coord(_c2), point_class(_c3).
+---+---+---+---+
|_c0|_c1|_c2|_c3|
+---+---+---+---+
|1 |0.0|0.0|a |
|2 |0.0|1.0|a |
|3 |1.0|0.0|b |
|4 |3.0|4.0|b |
|5 |8.0|7.0|b |
|6 |4.0|9.0|b |
|7 |2.0|5.0|a |
|8 |1.0|9.0|a |
|9 |3.0|6.0|a |
|10 |8.0|2.0|c |
|11 |9.0|1.0|a |
|12 |2.0|7.0|c |
|13 |2.0|9.0|c |
|14 |2.0|4.0|b |
|15 |1.0|3.0|c |
|16 |4.0|6.0|c |
|17 |3.0|5.0|c |
|18 |5.0|3.0|a |
|19 |5.0|9.0|b |
|20 |8.0|9.0|c |
+---+---+---+---+
I have created a function that takes a x_coord and a y_coord of any given point and returns the cell that the particular point belongs in space (there are 4 cells).
def icchId(X : Double, Y : Double, F_avgX: Double, F_avgY : Double) : Any = {
if(X < F_avgX && Y < F_avgY){
return "ICCH 1"
}
else if(X < F_avgX && Y >= F_avgY){
return "ICCH 2"
}
else if(X >= F_avgX && Y >= F_avgY){
return "ICCH 3"
}
else if(X > F_avgX && Y < F_avgY){
return "ICCH 4"
}
else
return 0
}
My goal is to create an RDD that each of its rows has this form:
[point_ICCHid] (the icchId functions return value), [x_coord - y_coord] (key), [point_class](value)
The point_ICCHid will be provided by the icchId function. The x and y_coord from the dataframe as well as the class of each point.
My attempt is the one shown below:
val trainingRDD : RDD[Row] = trainingDataFrame.rdd.map(r => (icchId(r(1),r(2),avgX,avgY),(r(1),r(2),r(3))) )
but I get this error:
error: type mismatch; found : Any required: Double val trainingRDD : RDD[Row] = trainingDataFrame.rdd.map(r => (icchId(r(1),r(2),avgX,avgY),(r(1),r(2),r(3))) )
Note that I am using databricks community edition for this project and also that I am trying to pass into the creation of my RDD a custom function.
EDIT:
After some tweaks in the creation of the RDD based on the answered given in the comments about a similar question I came up with this edit in the line of code:
val trainingRDD : RDD[Row] = trainingDataFrame.rdd.map(r => r.icchId(1,2,avgX,avgY))
Now the error I think is much worse than before:
error: value icchId is not a member of org.apache.spark.sql.Row