I use spark-xml library from databricks for parsing xml file (550 MB).
Dataset books= spark.sqlContext().read()
.format("com.databricks.spark.xml")
.option("rootTag", "books")
.option("rowTag", "book")
.option("treatEmptyValuesAsNulls", "true")
.load("path");
Spark parses the file a first time with many tasks/partitions.
Then, when I call this code :
books.select("code").count()
Spark starts a new parsing.
Is it a solution to avoid a parsing file each function call on the dataset ?