RDD vs DataFrame (storing data types)

Question

I'm reading a book on Apache Spark and in comparing between RDD and DataFrame, it has the following to say:

The key difference between RDD and DataFrame is that DataFrame stores much more information about the data, such as the data types and names of the columns, than RDD. This allows the DataFrame to optimize the processing much more effectively than Spark transformations and Spark actions doing processing on RDD.

However, when playing around with RDD's using Scala, I've noticed that the data type is, in fact, stored. For example:

val acTuplesByAmount = acBalTuples.map{case (amount, accno) => (amount.toDouble, accno)}
acTuplesByAmount.collect()
res5: Array[(Double, String)] = Array((50000.0,SB10001), (12000.0,SB10002), (8500.0,SB10004), (5000.0,SB10005), (3000.0,SB10003))

As you can see, it keeps track of the fact that we wanted a Double and a String. Before my map, I think it probably would have been two Strings.

So is the book wrong? Or do DataFrames still have superior data types somehow?

And [Difference between DataSet API and DataFrame API](https://stackoverflow.com/q/37301226/8371915) — Alper t. Turker, Aug 08 '18 at 19:44
And [Difference between DataFrame (in Spark 2.0 i.e DataSet and RDD in Spark](https://stackoverflow.com/q/31508083/8371915) — Alper t. Turker, Aug 08 '18 at 19:44

score 0 · Answer 1 · answered Aug 08 '18 at 19:42

The book is correct. Types you see are transparent for Spark engine. In other hands, Dataset has schema, which define type of each column. You can print it using dataset.printSchema(). Those types are not transparent for engine and Spark can i.e. replace some expressions or push it to the source, if it recognize that this optimization will be good for performance

score 0 · Answer 2 · answered Aug 08 '18 at 21:23

0

Indeed the below answer and book are correct, but 1) sql approach is possible with DF, 2) rdds allow tuples and less structured types of data to be processed, different use cases.

answered Aug 08 '18 at 21:23

thebluephantom

16,458
8
40
83

score 0 · Answer 3 · answered Aug 08 '18 at 22:50

0

In DataFrame, spark shuffle only the data because all executors know the data schema. In RDD, they're java objects serialized which is much more expensive to shuffle and carries all info about the data again.

answered Aug 08 '18 at 22:50

Henrique Goulart

1,815
2
22
32

RDD vs DataFrame (storing data types)

3 Answers3