0

I'm reading a book on Apache Spark and in comparing between RDD and DataFrame, it has the following to say:

The key difference between RDD and DataFrame is that DataFrame stores much more information about the data, such as the data types and names of the columns, than RDD. This allows the DataFrame to optimize the processing much more effectively than Spark transformations and Spark actions doing processing on RDD.

However, when playing around with RDD's using Scala, I've noticed that the data type is, in fact, stored. For example:

val acTuplesByAmount = acBalTuples.map{case (amount, accno) => (amount.toDouble, accno)}
acTuplesByAmount.collect()
res5: Array[(Double, String)] = Array((50000.0,SB10001), (12000.0,SB10002), (8500.0,SB10004), (5000.0,SB10005), (3000.0,SB10003))

As you can see, it keeps track of the fact that we wanted a Double and a String. Before my map, I think it probably would have been two Strings.

So is the book wrong? Or do DataFrames still have superior data types somehow?

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
Stephen
  • 8,508
  • 12
  • 56
  • 96

3 Answers3

0

The book is correct. Types you see are transparent for Spark engine. In other hands, Dataset has schema, which define type of each column. You can print it using dataset.printSchema(). Those types are not transparent for engine and Spark can i.e. replace some expressions or push it to the source, if it recognize that this optimization will be good for performance

T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
0

Indeed the below answer and book are correct, but 1) sql approach is possible with DF, 2) rdds allow tuples and less structured types of data to be processed, different use cases.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
0

In DataFrame, spark shuffle only the data because all executors know the data schema. In RDD, they're java objects serialized which is much more expensive to shuffle and carries all info about the data again.

Henrique Goulart
  • 1,815
  • 2
  • 22
  • 32