0

I use spark-xml library from databricks for parsing xml file (550 MB).

Dataset books= spark.sqlContext().read()
            .format("com.databricks.spark.xml")
            .option("rootTag", "books")
            .option("rowTag", "book")
            .option("treatEmptyValuesAsNulls", "true")
            .load("path");

Spark parses the file a first time with many tasks/partitions.

Then, when I call this code :

books.select("code").count()

Spark starts a new parsing.

Is it a solution to avoid a parsing file each function call on the dataset ?

robynico
  • 157
  • 1
  • 12
  • How do you know spark is parsing it twice? If there is only one action involved, Spark will parse it only one unless there is some thing like a self-join before the action. You can always cache/persist the dataset if you feel you are forcing multiple actions on the same dataset and/or are deriving multiple datasets from it. – philantrovert Mar 14 '18 at 12:13
  • Because, spark logs several times this message : INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 20, localhost, executor driver, partition 0, PROCESS_LOCAL, 7907 bytes) INFO Executor: Running task 0.0 in stage 2.0 (TID 20) INFO NewHadoopRDD: Input split: file:/C:/books.xml:0+33554432 – robynico Mar 14 '18 at 13:14
  • Do you do anything more in addition to the `count`? You could add `.cache` after loading the dataset to avoid reading more than once. – Shaido Mar 15 '18 at 01:28

0 Answers0