137

I'm attempting to print the contents of a collection to the Spark console.

I have a type:

linesWithSessionId: org.apache.spark.rdd.RDD[String] = FilteredRDD[3]

And I use the command:

scala> linesWithSessionId.map(line => println(line))

But this is printed :

res1: org.apache.spark.rdd.RDD[Unit] = MappedRDD[4] at map at :19

How can I write the RDD to console or save it to disk so I can view its contents?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
blue-sky
  • 51,962
  • 152
  • 427
  • 752

10 Answers10

280

If you want to view the content of a RDD, one way is to use collect():

myRDD.collect().foreach(println)

That's not a good idea, though, when the RDD has billions of lines. Use take() to take just a few to print out:

myRDD.take(n).foreach(println)
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Oussama
  • 3,214
  • 1
  • 11
  • 9
  • 1
    if i use foreach on RDD ( which has millions of lines) to write the content into HDFS as single file, will it work without any issues on cluster? – Shankar Jul 20 '15 at 11:29
  • 1
    The reason i am not using `saveAsTextFile` on RDD is , i need to write the RDD content into more than one file, that's why i am using `foreach` – Shankar Jul 20 '15 at 11:34
  • If you want to save in a single file, you can coalesce you RDD into one partition before calling saveAsTextFile, but again this may cause issues. I think the best option is to write in multiple files in HDFS, then use hdfs dfs --getmerge in order to merge the files – Oussama Jul 21 '15 at 16:10
  • you said that when use foreach on an RDD it will persist it into the driver's RAM, is the Statement correct? because what i understood is foreach will run on each worker[cluster] not on driver. – Shankar Jul 22 '15 at 06:23
  • saveAsTextFile will write one file per partition, which is what you want (multiple files). Otherwise as Oussama suggests, you can do rdd.coalesce(1).saveAsTextFile() to get one file. If the RDD has too few partitions for your liking, you can try rdd.repartition(N).saveAsTextFile() – foghorn Jan 14 '16 at 19:57
  • @Oussama One question. How can we print an `RDD` of type `Array[String]`? When I try the manner above, I just get memory addresses. Is there an `toString()` functionality? – Brian Mar 03 '16 at 02:08
  • @Brian , you can use whatever function instead of println, you won't have any serialization problems since since you use take(n). Therefore you can create your own printing function and put it instead of println – Oussama Mar 08 '16 at 07:46
  • `take` doesn't seem to work for me. I think it's sorting the results, which is not what I want (for a start, I have a `scala.Tuple2` which does not implement `Comparable`) is there a way to just take some results as fast as possible but not care which? – mjaggard Aug 16 '16 at 07:33
  • You could also add this link to official doc: http://spark.apache.org/docs/latest/programming-guide.html#printing-elements-of-an-rdd – dk14 Oct 22 '16 at 06:01
50

The map function is a transformation, which means that Spark will not actually evaluate your RDD until you run an action on it.

To print it, you can use foreach (which is an action):

linesWithSessionId.foreach(println)

To write it to disk you can use one of the saveAs... functions (still actions) from the RDD API

Community
  • 1
  • 1
fedragon
  • 884
  • 9
  • 10
  • 7
    Maybe you need to mention `collect` so that the RDD can be printed in the console. – zsxwing Apr 20 '14 at 08:30
  • 1
    `foreach` itself will first "materialize" the RDD and then run `println` on each element, so `collect` is not really needed here (although you can use it, of course)... – fedragon Apr 20 '14 at 10:10
  • 5
    Actually without collect(), before foreach, I'm not able to see anything on console. – Vittorio Cozzolino May 07 '14 at 13:53
  • 1
    On Spark 1.2.0 this does not print the RDD. @Oussama's answer does work, however. – Matthew Cornell Jan 05 '15 at 15:14
  • 3
    Actually it works totally fine in my Spark shell, even in 1.2.0. But I think I know where this confusion comes from: the original question asked how to print an RDD to the Spark console (= shell) so I assumed he would run a local job, in which case `foreach` works fine. If you are running a job on a cluster and you want to print your rdd then you should `collect` (as pointed out by other comments and answers) so that it is sent to the driver before `println` is executed. And using `take` as suggested by Oussama might be a good idea if your RDD is too big. – fedragon Jan 07 '15 at 07:49
  • 8
    The above answer is bad. You should unaccept it. Foreach will not print to the console, it will print on your worker nodes. If you have only one node then foreach will work. But if you have only one node, then why are you using spark? Just use SQL awk, or Grep, or something much simpler. So I think the only valid answer is collect. If collect is to big for you and you only want a sample use take or head or simillar functions as described below. – eshalev Feb 12 '16 at 07:12
16

You can convert your RDD to a DataFrame then show() it.

// For implicit conversion from RDD to DataFrame
import spark.implicits._

fruits = sc.parallelize([("apple", 1), ("banana", 2), ("orange", 17)])

// convert to DF then show it
fruits.toDF().show()

This will show the top 20 lines of your data, so the size of your data should not be an issue.

+------+---+                                                                    
|    _1| _2|
+------+---+
| apple|  1|
|banana|  2|
|orange| 17|
+------+---+
Sam
  • 11,799
  • 9
  • 49
  • 68
13

If you're running this on a cluster then println won't print back to your context. You need to bring the RDD data to your session. To do this you can force it to local array and then print it out:

linesWithSessionId.toArray().foreach(line => println(line))
Noah
  • 13,821
  • 4
  • 36
  • 45
2
c.take(10)

and Spark newer version will show table nicely.

Hrvoje
  • 13,566
  • 7
  • 90
  • 104
1

There are probably many architectural differences between myRDD.foreach(println) and myRDD.collect().foreach(println) (not only 'collect', but also other actions). One the differences I saw is when doing myRDD.foreach(println), the output will be in a random order. For ex: if my rdd is coming from a text file where each line has a number, the output will have a different order. But when I did myRDD.collect().foreach(println), order remains just like the text file.

1

In python

   linesWithSessionIdCollect = linesWithSessionId.collect()
   linesWithSessionIdCollect

This will printout all the contents of the RDD

1

Instead of typing each time, you can;

[1] Create a generic print method inside Spark Shell.

def p(rdd: org.apache.spark.rdd.RDD[_]) = rdd.foreach(println)

[2] Or even better, using implicits, you can add the function to RDD class to print its contents.

implicit class Printer(rdd: org.apache.spark.rdd.RDD[_]) {
    def print = rdd.foreach(println)
}

Example usage:

val rdd = sc.parallelize(List(1,2,3,4)).map(_*2)

p(rdd) // 1
rdd.print // 2

Output:

2
6
4
8

Important

This only makes sense if you are working in local mode and with a small amount of data set. Otherwise, you either will not be able to see the results on the client or run out of memory because of the big dataset result.

Community
  • 1
  • 1
koders
  • 5,654
  • 1
  • 25
  • 20
0

You can also save as a file: rdd.saveAsTextFile("alicia.txt")

Thomas Decaux
  • 21,738
  • 2
  • 113
  • 124
0

In java syntax:

rdd.collect().forEach(line -> System.out.println(line));
ForeverLearner
  • 1,901
  • 2
  • 28
  • 51