0

Earlier I posted a problem of self join in scala. I am trying to implement the same in Spark but not able to convert. Here is the problem and my code. Input data set...

Proprty_ID, latitude, longitude, Address 123, 33.84, -118.39, null 234, 35.89, -119.48, null 345, 35.34, -119.39, null

Output data set

Property_ID1, Property_ID2, distance 123,123,0 123,234,0.1 123,345,0.6 234,234,0 234,123,0.1 234,345,0.7 345,345,0 345,123,0.6 345,234,0.7

Spark Code:

`import math._

object Haversine {
   val R = 6372.8  //radius in km

   def haversine(lat1:Double, lon1:Double, lat2:Double, lon2:Double)={
      val dLat=(lat2 - lat1).toRadians
      val dLon=(lon2 - lon1).toRadians

      val a = pow(sin(dLat/2),2) + pow(sin(dLon/2),2) * cos(lat1.toRadians) * cos(lat2.toRadians)
      val c = 2 * asin(sqrt(a))
      R * c
   }

   def main(args: Array[String]): Unit = {
      println(haversine(36.12, -86.67, 33.94, -118.40))
  }
}

class SimpleCSVHeader(header:Array[String]) extends Serializable {
  val index = header.zipWithIndex.toMap
  def apply(array:Array[String], key:String):String = array(index(key))
}


val csv = sc.textFile("geo.csv")  // original file
val data = csv.map(line => line.split(",").map(elem => elem.trim)) //lines in rows
val header = new SimpleCSVHeader(data.take(1)(0)) // we build our header with the first line
val rows = data.filter(line => header(line,"latitude") != "latitude") // filter the header out

// val users = rows.map(row => header(row,"user")
// val usersByHits = rows.map(row => header(row,"user") -> header(row,"hits").toInt)

val typed = rows.map{ case Array(id, lat, lon) => (id, lat.toDouble, lon.toDouble)}

`

After this I need to do the self join on typed and pass it thru the Haversine method. I got the Scala code as below from community which I need to convert it to Spark code to work with RDDs. Below code is currently working for lists.

`val combos = for {
    a <- typed
    b <- typed
  } yield (a,b)

combos.map{ case ((id1, lat1, lon1), (id2, lat2, lon2)) 
     => id1 + "," + id2 + "," + haversine(lat1, lon1, lat2, lon2)} foreach println`

Can anybody help? Thanks in advance.

Amit
  • 45
  • 8

1 Answers1

0

The Spark operation you want is cartesian. You can learn more at Spark: produce RDD[(X, X)] of all possible combinations from RDD[X].

Community
  • 1
  • 1
Joe Pallas
  • 2,105
  • 1
  • 14
  • 17