0

It was my assumption that Spark Data Frames were built from RDDs. However, I recently learned that this is not the case, and Difference between DataFrame, Dataset, and RDD in Spark does a good job explaining that they are not.

So what is the overhead of converting an RDD to a DataFrame, and back again? Is it negligible or significant?

In my application, I create a DataFrame by reading a text file into an RDD and then custom-encoding every line with a map function that returns a Row() object. Should I not be doing this? Is there a more efficient way?

vy32
  • 28,461
  • 37
  • 122
  • 246
  • Well, you could do `SparkSession.read.text("file")`, but it would still require parsing each line into typed columns. – OneCricketeer Apr 27 '19 at 21:54
  • @cricket_007 yes, I can do that, but is it more efficient to use that, or to go the RDD approach? – vy32 Apr 27 '19 at 22:34
  • I think it depends on the input format. For example, json, avro, parquet, etc have well defined schemas and types... Xml or csv are just read as strings and require some amount of parsing and casting into proper datatypes for a Dataset object to even work. Personally, I prefer starting with Row objects, then later building the Dataset schema when I need it, but I can't think of a case when you'd go back to an RDD – OneCricketeer Apr 27 '19 at 22:39
  • Our output format is pipe-delimited text files. – vy32 Apr 27 '19 at 22:46
  • Then using `spark.read.option("delimiter", "|").csv("file")` as a DataFrame would be preferred. – OneCricketeer Apr 28 '19 at 01:13
  • @cricket_007, so basically, you are saying that the overhead is unknown, so we should not convert from RDD to DataFrame and vice-versa? – vy32 Apr 29 '19 at 00:12
  • I personally don't know it... That doesn't mean that it overall is not known. Of course there's serialization overhead. Most especially if you are using Pyspark. And like I said, I don't know why you'd go "vice versa" and making RDD is necessary for non standard data files. Similarly, not all RDD need to be "relational" in a table format or have sql operations done – OneCricketeer Apr 29 '19 at 02:23

1 Answers1

1

RDDs have a double role in Spark. First of all is the internal data structure for tracking changes between stages in order to manage failures and secondly until Spark 1.3 was the main interface for interaction with users. Therefore after after Spark 1.3 Dataframes constitute the main interface offering much richer functionality than RDDs.

There is no significant overhead when converting one Dataframe to RDD with df.rdd since the dataframes they already keep an instance of their RDDs initialized therefore returning a reference to this RDD should not have any additional cost. On the other side, generating a dataframe from an RDD requires some extra effort. There are two ways to convert an RDD to dataframe 1st by calling rdd.toDF() and 2nd with spark.createDataFrame(rdd, schema). Both methods will evaluate lazily although there will be an extra overhead regarding the schema validation and execution plan (you can check the toDF() code here for more details). Of course that would be identical to the overhead that you have just by initializing your data with spark.read.text(...) but with one less step, the conversion from RDD to dataframe.

This the first reason that I would go directly with Dataframes instead of working with two different Spark interfaces.

The second reason is that when using the RDD interface you are missing some significant performance features that dataframes and datasets offer related to Spark optimizer (catalyst) and memory management (tungsten).

Finally I would use the RDDs interface only if I need some features that are missing in dataframes such as key-value pairs, zipWithIndex function etc. But even then you can access those via df.rdd which is costless as already mentioned. As for your case , I believe that would be faster to use directly a dataframe and use the map function of that dataframe to ensure that Spark leverages the usage of tungsten ensuring efficient memory management.

abiratsis
  • 7,051
  • 3
  • 28
  • 46
  • 4
    "There is no significant overhead when converting one Dataframe to RDD". https://stackoverflow.com/a/37090151/215945 implies there is some (non-trivial) overhead. That answer seems to imply it is more than just returning a reference to the underlying RDD. – Mark Rajcok Jul 17 '19 at 15:41