-1

I am comparing 2 dataframes in scala/spark using a nested loop and an external jar.

for (nrow <- dfm.rdd.collect) {   
  var mid = nrow.mkString(",").split(",")(0)
  var mfname = nrow.mkString(",").split(",")(1)
  var mlname = nrow.mkString(",").split(",")(2)  
  var mlssn = nrow.mkString(",").split(",")(3)  

  for (drow <- dfn.rdd.collect) {
    var nid = drow.mkString(",").split(",")(0)
    var nfname = drow.mkString(",").split(",")(1)
    var nlname = drow.mkString(",").split(",")(2)  
    var nlssn = drow.mkString(",").split(",")(3)  

    val fNameArray = Array(mfname,nfname)
    val lNameArray = Array (mlname,nlname)
    val ssnArray = Array (mlssn,nlssn)

    val fnamescore = Main.resultSet(fNameArray)
    val lnamescore = Main.resultSet(lNameArray)
    val ssnscore =  Main.resultSet(ssnArray)

    val overallscore = (fnamescore +lnamescore +ssnscore) /3

    if(overallscore >= .95) {
       println("MeditechID:".concat(mid)
         .concat(" MeditechFname:").concat(mfname)
         .concat(" MeditechLname:").concat(mlname)
         .concat(" MeditechSSN:").concat(mlssn)
         .concat(" NextGenID:").concat(nid)
         .concat(" NextGenFname:").concat(nfname)
         .concat(" NextGenLname:").concat(nlname)
         .concat(" NextGenSSN:").concat(nlssn)
         .concat(" FnameScore:").concat(fnamescore.toString)
         .concat(" LNameScore:").concat(lnamescore.toString)
         .concat(" SSNScore:").concat(ssnscore.toString)
         .concat(" OverallScore:").concat(overallscore.toString))
    }
  }
}

What I'm hoping to do is add some parallelism to the outer loop so that I can create a threadpool of 5 and pull 5 records from the collection of the outerloop and compare them to the collection of the inner loop, rather than doing this serially. So the outcome would be I can specify the number of threads, have 5 records from the outerloop's collection processing at any given time against the collection in the inner loop. How would I go about doing this?

Oli
  • 9,766
  • 5
  • 25
  • 46
jymbo
  • 1,335
  • 1
  • 15
  • 26

2 Answers2

4

Let's start by analyzing what you are doing. You collect the data of dfm to the driver. Then, for each element you collect the data from dfn, transform it and compute a score for each pair of elements.

That's problematic in many ways. First even without considering parallel computing, the transformations on the elements of dfn are made as many times as dfm as elements. Also, you collect the data of dfn for every row of dfm. That's a lot of network communications (between the driver and the executors).

If you want to use spark to parallelize you computations, you need to use the API (RDD , SQL or Datasets). You seem to want to use RDDs to perform a cartesian product (this is O(N*M) so be careful, it may take a while).

Let's start by transforming the data before the Cartesian product to avoid performing them more than once per element. Also, for clarity, let's define a case class to contain your data and a function that transform your dataframes into RDDs of that case class.

case class X(id : String, fname : String, lname : String, lssn : String)
def toRDDofX(df : DataFrame) = {
    df.rdd.map(row => {
        // using pattern matching to convert the array to the case class X
        row.mkString(",").split(",") match {
            case Array(a, b, c, d) => X(a, b, c, d)
        } 
    })
}

Then, I use filter to keep only the tuples whose score is more than .95 but you could use map, foreach... depending on what you intend to do.

val rddn = toRDDofX(dfn)
val rddm = toRDDofX(dfm)
rddn.cartesian(rddm).filter{ case (xn, xm) => {
    val fNameArray = Array(xm.fname,xn.fname)
    val lNameArray = Array(xm.lname,xn.lname)
    val ssnArray = Array(xm.lssn,xn.lssn)

    val fnamescore = Main.resultSet(fNameArray)
    val lnamescore = Main.resultSet(lNameArray)
    val ssnscore =  Main.resultSet(ssnArray)

    val overallscore = (fnamescore +lnamescore +ssnscore) /3
    // and then, let's say we filter by score
    overallscore > .95
}} 
Oli
  • 9,766
  • 5
  • 25
  • 46
  • ty very much for this detailed explanation. Your code is working and by setting this to val rdd then rdd.take(100).foreach(println) it displays the filtered records as: X(val1, val2,.....) I'm now trying to figure out how to write this RDD of arrays to a dataframe so I can then write it back to the database. Keeps erroring out, but ill keep playing with it. TY again! – jymbo May 29 '19 at 05:40
  • And I figured out how to write the results back to a dataframe. Thanks! – jymbo May 29 '19 at 06:05
1

This is not a right way of iterating over spark dataframe. The major concern is the dfm.rdd.collect. If the dataframe is arbitrarily large, you would end up exception. This due to the fact that the collect function essentially brings all the data into the master node.

Alternate way would be use the foreach or map construct of the rdd.

dfm.rdd.foreach(x => {
    // your logic
}  

Now you are trying to iterate the second dataframe here. I am afraid that won't be possible. The elegant way is to join the dfm and dfn and iterate over the resulting dataset to compute your function.

Avishek Bhattacharya
  • 6,534
  • 3
  • 34
  • 53
  • New to scala/spark so I don't know all of the esoteric ins and outs. This is why I'm asking the question here...thus don't understand the need for the down-vote. How would you suggest I implement the dataframe join code to accompish this? – jymbo May 27 '19 at 08:13
  • Firstly, sorry for the downvote, I didn't do that. You might need a little more dive deep into spark before proceeding with the problem. The way I would proceed is to join dfm and dfn on their respective uniqueIds. Then iterate over the resulting dataframe to produce another dataframe with the required fields. https://stackoverflow.com/questions/49252670/iterate-rows-and-columns-in-spark-dataframe Might help. Another link https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-joins.html – Avishek Bhattacharya May 27 '19 at 08:18
  • I'll look into your suggested links. Unfortunately I can't join on an ID, I have to perform a cartesian on both dataframes to compare every record from the outer loop with every record in the inner loop. – jymbo May 27 '19 at 08:22
  • If your datasets are small (GBs), you could try to do Cartesian. See the problem is you can't do the nested looping on data frames in spark. The way to do is join and foreach/map. – Avishek Bhattacharya May 27 '19 at 08:24