1

I'm trying to unit test the values returned in a String, but when I'm trying to print the console gives

MapPartitionsRDD[32]

My code is as follows:

UPDATED:

val src = exact_bestmatch_src.filter(line => line.split(",")(0).toInt.equals(i))
val dest = exact_bestmatch_Dest.filter(line => line.split(",")(0).toInt.equals(i)).toArray()

 for (print1 <- src) {       
    var n1:String = src.toString()       
    var sourceArr: Array[String] = n1.split(",")


    for (print2 <- dest) {        
      var n2: String = dest.toString()

      for (i <- 0 until sourceArr.length) {          
        if (n1.split(",")(i).equals(n2.split(",")(i))) {


        }
      }

I also tried println(n1.mkstring())

I'm trying to compare both src and dest RDD's to find out the differences between both the rows

Phantômaxx
  • 37,901
  • 21
  • 84
  • 115
Vickyster
  • 163
  • 3
  • 5
  • 18
  • 3
    Possible duplicate of [How to print the contents of RDD?](http://stackoverflow.com/questions/23173488/how-to-print-the-contents-of-rdd) – OneCricketeer May 08 '17 at 11:57

3 Answers3

1

If you want to see each record in the RDD printed as a separate line, you can use:

src.foreach(println)

This will run the println function on each record, within the executor that holds it (which might be several different executors). If this runs in some test using Spark's "local" mode, there's only one "executor" and it's the same process as the driver, so that's not a problem.

Alternatively, if you do have more than one executor (non-local mode) and you want to make sure the RDD's elements are printed to the driver console, you can first collect the RDD's elements into a local collection and then print them:

src.collect().foreach(println)

NOTE that this assumes the RDD is small enough to be collected into a single machine's memory.

Calling toString on an RDD does not access the RDD's data (as it might be too large to fit as a String in the driver machine's memory), as you observed it just prints the type of the RDD and its ID.

Tzach Zohar
  • 37,442
  • 3
  • 79
  • 85
0

You don't have a list or array. You'd need to collect() an RDD in order to get one, or you need to iterate it via foreach.

Calling println on any object already calls the toString method for it, by the way. And RDD doesn't have a mkString method

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
0

Calling toString on src just means you are getting a string representation which can be anything. For RDD this is not the content of the RDD (as this would require getting all the content of the RDD to the driver and printing it at once).

As other have mentioned in order to print the content of the RDD you need to first get all the data to the driver.

Let's consider the simple solution already proposed:

src.collect().foreach(println)

The first part - collect tells spark to get all the content of the RDD and bring it to the driver as a sequence of records. The foreach tells scala to go over each record in the sequence and pass it as argument to the println function which would print it. You could of course use mkstring instead of foreach to get a single string.

Assaf Mendelson
  • 12,701
  • 5
  • 47
  • 56
  • thanks for the comment, in order to compare source and destionation RDD I should make it as a string like the below if (n1.split(",")(i).equals(n2.split(",")(i))){ } or is there is any way to compare cells of source and destination with the corresponding index of it (i.e) 0th column of source should be compared with 0th column of destination like wise for the remaining columns – Vickyster May 08 '17 at 12:19
  • This depends on your exact goal. If your end data (after all operations) is small, then you can simply use collect to bring it to the driver and use standard scala collection operations. If the data is large, you should consider doing a join between the two and filter everything except differences. Then you can look only on the first few differences – Assaf Mendelson May 08 '17 at 12:23