1

Spark 3.0 enables reading binary data using a new data source:

val df = spark.read.format(“binaryFile”).load("/path/to/data")

Using previous spark versions you cloud load data using:

val rdd = sc.binaryFiles("/path/to/data")

Beyond having the option to access binary data using the High-Level API (Dataset) is there any additional benefits or features that spark 3.0 introduce with this feature?

Yosi Dahari
  • 6,794
  • 5
  • 24
  • 44

1 Answers1

-1

I dont think there is any additional benefit besides developers have more control over data with high level API (Dataframe/ Dataset) than low level (RDD), and they dont need to worry about performance as it is well optimized/ managed by high level API by its own.

Reference - https://spark.apache.org/docs/3.0.0-preview/sql-data-sources-binaryFile.html

P.S. - I do think my answer does not qualify as a formal answer. I earlier wanted to add it as comment only but unable to do so because I am yet to earn privilege of commenting.. :)

Shantanu Kher
  • 1,014
  • 1
  • 8
  • 14