Spark binary data source vs sc.binaryFiles

Question

Spark 3.0 enables reading binary data using a new data source:

val df = spark.read.format(“binaryFile”).load("/path/to/data")

Using previous spark versions you cloud load data using:

val rdd = sc.binaryFiles("/path/to/data")

Beyond having the option to access binary data using the High-Level API (Dataset) is there any additional benefits or features that spark 3.0 introduce with this feature?

score -1 · Answer 1 · answered Jun 26 '20 at 13:50

-1

I dont think there is any additional benefit besides developers have more control over data with high level API (Dataframe/ Dataset) than low level (RDD), and they dont need to worry about performance as it is well optimized/ managed by high level API by its own.

Reference - https://spark.apache.org/docs/3.0.0-preview/sql-data-sources-binaryFile.html

P.S. - I do think my answer does not qualify as a formal answer. I earlier wanted to add it as comment only but unable to do so because I am yet to earn privilege of commenting.. :)

answered Jun 26 '20 at 13:50

Shantanu Kher

1,014
1
8
14

Thanks for your answer. I would expect a formal documentation that discusses the differences, actual code of the new format or testing to explain this. – Yosi Dahari Jun 26 '20 at 13:53
I find pretty confusing that you cannot control parallelism as in previous version. – AlbertoAndreotti Oct 20 '20 at 20:09

Spark binary data source vs sc.binaryFiles

1 Answers1