I'm reading a book on Apache Spark and in comparing between RDD and DataFrame, it has the following to say:
The key difference between RDD and DataFrame is that DataFrame stores much more information about the data, such as the data types and names of the columns, than RDD. This allows the DataFrame to optimize the processing much more effectively than Spark transformations and Spark actions doing processing on RDD.
However, when playing around with RDD's using Scala, I've noticed that the data type is, in fact, stored. For example:
val acTuplesByAmount = acBalTuples.map{case (amount, accno) => (amount.toDouble, accno)}
acTuplesByAmount.collect()
res5: Array[(Double, String)] = Array((50000.0,SB10001), (12000.0,SB10002), (8500.0,SB10004), (5000.0,SB10005), (3000.0,SB10003))
As you can see, it keeps track of the fact that we wanted a Double and a String. Before my map, I think it probably would have been two Strings.
So is the book wrong? Or do DataFrames still have superior data types somehow?