0

I am working on a spark program in which I have to load avro data and process it. I am trying to understand how the job ids are created for a spark application. I use the below line of code to load the avro data.

sqlContext.read.format("com.databricks.spark.avro").load(path)

As far as I know job ids will be created based on the actions encountered in the program. My job is scheduled to run every 30 mins. When I look at the spark history server for this application, I see a job id is created for the load operation. It happens only sometimes and log looks absolutely fine. I am using spark 1.6.1

I am curios to know if the load operation creates new job id in an application?

srujana
  • 182
  • 2
  • 10

1 Answers1

0

In general data loading operations in Spark SQL are not lazy unless you provide schema for DataFrameReader. Depending on a source scope and impact can vary from simple metadata access to full data scan.

In this particular case it is pretty much limited to file system scan and a single file access to read schema.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • In my application, there are 2 load operations. One to load parquet data and the other to load avro data. Job ID is created at parquet data load operation, but job id creation doesn't look consistent with avro data load operation. Sometimes I don't see job id at avro data load operation. I am wondering why this is happening. If you have an idea, can you elaborate on this behavior? – srujana Jul 18 '16 at 18:16
  • I tried to find out and debug to understand why it is inconsistent and added persist as well to the load step which returns dataframe. Even then, the jobs in Spark UI are incosistent. – srujana Jul 18 '16 at 18:19