How to print the contents of RDD?

Question

I'm attempting to print the contents of a collection to the Spark console.

I have a type:

linesWithSessionId: org.apache.spark.rdd.RDD[String] = FilteredRDD[3]

And I use the command:

scala> linesWithSessionId.map(line => println(line))

But this is printed :

res1: org.apache.spark.rdd.RDD[Unit] = MappedRDD[4] at map at :19

How can I write the RDD to console or save it to disk so I can view its contents?

Hi! did you read the comments on the answer, accepted by you? It appears to be misleading — dk14, Oct 22 '16 at 06:18
RDD are being relegated as second class citizens, you should use DataFrame and the ``show`` method. — Thomas Decaux, Oct 16 '17 at 15:07

score 280 · Accepted Answer · edited Apr 17 '15 at 19:34

280

If you want to view the content of a RDD, one way is to use collect():

myRDD.collect().foreach(println)

That's not a good idea, though, when the RDD has billions of lines. Use take() to take just a few to print out:

myRDD.take(n).foreach(println)

edited Apr 17 '15 at 19:34

Jacek Laskowski

72,696
27
242
420

answered Apr 24 '14 at 13:53

Oussama

3,214
1
11
9

1

if i use foreach on RDD ( which has millions of lines) to write the content into HDFS as single file, will it work without any issues on cluster? – Shankar Jul 20 '15 at 11:29
1

The reason i am not using `saveAsTextFile` on RDD is , i need to write the RDD content into more than one file, that's why i am using `foreach` – Shankar Jul 20 '15 at 11:34
If you want to save in a single file, you can coalesce you RDD into one partition before calling saveAsTextFile, but again this may cause issues. I think the best option is to write in multiple files in HDFS, then use hdfs dfs --getmerge in order to merge the files – Oussama Jul 21 '15 at 16:10
you said that when use foreach on an RDD it will persist it into the driver's RAM, is the Statement correct? because what i understood is foreach will run on each worker[cluster] not on driver. – Shankar Jul 22 '15 at 06:23
saveAsTextFile will write one file per partition, which is what you want (multiple files). Otherwise as Oussama suggests, you can do rdd.coalesce(1).saveAsTextFile() to get one file. If the RDD has too few partitions for your liking, you can try rdd.repartition(N).saveAsTextFile() – foghorn Jan 14 '16 at 19:57
@Oussama One question. How can we print an `RDD` of type `Array[String]`? When I try the manner above, I just get memory addresses. Is there an `toString()` functionality? – Brian Mar 03 '16 at 02:08
@Brian , you can use whatever function instead of println, you won't have any serialization problems since since you use take(n). Therefore you can create your own printing function and put it instead of println – Oussama Mar 08 '16 at 07:46
`take` doesn't seem to work for me. I think it's sorting the results, which is not what I want (for a start, I have a `scala.Tuple2` which does not implement `Comparable`) is there a way to just take some results as fast as possible but not care which? – mjaggard Aug 16 '16 at 07:33
You could also add this link to official doc: http://spark.apache.org/docs/latest/programming-guide.html#printing-elements-of-an-rdd – dk14 Oct 22 '16 at 06:01

score 50 · Answer 2 · edited Jun 18 '19 at 09:20

50

The map function is a transformation, which means that Spark will not actually evaluate your RDD until you run an action on it.

To print it, you can use foreach (which is an action):

linesWithSessionId.foreach(println)

To write it to disk you can use one of the saveAs... functions (still actions) from the RDD API

edited Jun 18 '19 at 09:20

Community

1
1

answered Apr 19 '14 at 18:26

fedragon

884
9
10

7

Maybe you need to mention `collect` so that the RDD can be printed in the console. – zsxwing Apr 20 '14 at 08:30
1

`foreach` itself will first "materialize" the RDD and then run `println` on each element, so `collect` is not really needed here (although you can use it, of course)... – fedragon Apr 20 '14 at 10:10
5

Actually without collect(), before foreach, I'm not able to see anything on console. – Vittorio Cozzolino May 07 '14 at 13:53
1

On Spark 1.2.0 this does not print the RDD. @Oussama's answer does work, however. – Matthew Cornell Jan 05 '15 at 15:14
3

Actually it works totally fine in my Spark shell, even in 1.2.0. But I think I know where this confusion comes from: the original question asked how to print an RDD to the Spark console (= shell) so I assumed he would run a local job, in which case `foreach` works fine. If you are running a job on a cluster and you want to print your rdd then you should `collect` (as pointed out by other comments and answers) so that it is sent to the driver before `println` is executed. And using `take` as suggested by Oussama might be a good idea if your RDD is too big. – fedragon Jan 07 '15 at 07:49
8

The above answer is bad. You should unaccept it. Foreach will not print to the console, it will print on your worker nodes. If you have only one node then foreach will work. But if you have only one node, then why are you using spark? Just use SQL awk, or Grep, or something much simpler. So I think the only valid answer is collect. If collect is to big for you and you only want a sample use take or head or simillar functions as described below. – eshalev Feb 12 '16 at 07:12

Sam · Answer 3 · 2017-12-07T04:50:44.553

16

You can convert your RDD to a DataFrame then show() it.

// For implicit conversion from RDD to DataFrame
import spark.implicits._

fruits = sc.parallelize([("apple", 1), ("banana", 2), ("orange", 17)])

// convert to DF then show it
fruits.toDF().show()

This will show the top 20 lines of your data, so the size of your data should not be an issue.

+------+---+                                                                    
|    _1| _2|
+------+---+
| apple|  1|
|banana|  2|
|orange| 17|
+------+---+

edited Dec 07 '17 at 04:50

answered Mar 17 '17 at 00:00

Sam

11,799
9
49
68

1

I think it is `import spark.implicits._` – Ryan Hartman Apr 03 '17 at 01:01
What is the library was used here? I can't detect neither `toDF` nor `spark.implicits._` in spark scope. – Sergii Jan 26 '20 at 12:07

score 13 · Answer 4 · answered Apr 19 '14 at 18:31

If you're running this on a cluster then println won't print back to your context. You need to bring the RDD data to your session. To do this you can force it to local array and then print it out:

linesWithSessionId.toArray().foreach(line => println(line))

score 2 · Answer 5 · answered Oct 27 '18 at 14:35

2

c.take(10)

and Spark newer version will show table nicely.

answered Oct 27 '18 at 14:35

Hrvoje

13,566
7
90
104

score 1 · Answer 6 · answered Jan 09 '16 at 21:32

There are probably many architectural differences between myRDD.foreach(println) and myRDD.collect().foreach(println) (not only 'collect', but also other actions). One the differences I saw is when doing myRDD.foreach(println), the output will be in a random order. For ex: if my rdd is coming from a text file where each line has a number, the output will have a different order. But when I did myRDD.collect().foreach(println), order remains just like the text file.

score 1 · Answer 7 · answered Apr 25 '16 at 00:57

1

In python

   linesWithSessionIdCollect = linesWithSessionId.collect()
   linesWithSessionIdCollect

This will printout all the contents of the RDD

answered Apr 25 '16 at 00:57

Niranjan Molkeri

97
8

1

Thanks but I tagged this question with scala not python – blue-sky Apr 25 '16 at 06:54

score 1 · Answer 8 · edited Jun 18 '19 at 04:09

Instead of typing each time, you can;

[1] Create a generic print method inside Spark Shell.

def p(rdd: org.apache.spark.rdd.RDD[_]) = rdd.foreach(println)

[2] Or even better, using implicits, you can add the function to RDD class to print its contents.

implicit class Printer(rdd: org.apache.spark.rdd.RDD[_]) {
    def print = rdd.foreach(println)
}

Example usage:

val rdd = sc.parallelize(List(1,2,3,4)).map(_*2)

p(rdd) // 1
rdd.print // 2

Output:

Important

This only makes sense if you are working in local mode and with a small amount of data set. Otherwise, you either will not be able to see the results on the client or run out of memory because of the big dataset result.

score 0 · Answer 9 · answered Oct 17 '16 at 11:44

0

You can also save as a file: rdd.saveAsTextFile("alicia.txt")

answered Oct 17 '16 at 11:44

Thomas Decaux

21,738
2
113
124

score 0 · Answer 10 · answered Nov 12 '19 at 05:56

0

In java syntax:

rdd.collect().forEach(line -> System.out.println(line));

answered Nov 12 '19 at 05:56

ForeverLearner

1,901
2
28
51

How to print the contents of RDD?

10 Answers10

Linked