0

while reading any csv, it is always converting into 3 stages whether csv file has small size or big or only it has headers in file. and there is always three jobs that has one stage per job. and my application has no any transformation and action.It is only loading csv.

public class WordCount {

public static void main(String[] args) throws InterruptedException {
    SparkSession spark = SparkSession.builder().appName("Java Spark 
       Application").master("local").getOrCreate();
    Dataset<Row> df = spark.read()
            .format("com.databricks.spark.csv")
            .option("inferschema", "true")
            .option("header", "true")
            .load("/home/ist/OtherCsv/EmptyCSV.csv");
    spark.close();
}}

Spark UI images:

  1. three jobs in spark UI
  2. stages relates info
  3. all three stages have same dag visualization
  4. and all three jobs have same dag visualization
  5. and this is event timeline

Questions:

  1. why loading or reading csv always split into exactly three stages and three jobs.
  2. why it is converting into three jobs when there is no any action?
  3. how stages are formed in code level?
  • go through https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html and it should be clear – Ramesh Maharjan Apr 03 '18 at 14:58
  • Question: how many stages do you expect? And how would you find out about the implementation of `read.csv`? – ernest_k Apr 03 '18 at 14:58
  • Spark is evolving fast, so your observations are not same for Spark 2.3.0, in which I see 2 jobs with identical stages. Also, I find that json load is not triggering any jobs (as would be expected without an action). So, these are best left as individual datsource implementation details, unless you really want to get into the code. The answer to a similar [question](https://stackoverflow.com/questions/49385724/how-to-know-the-number-of-spark-jobs-and-stages-in-broadcast-join-query) by me discusses why stages may show as identical in the spark-ui – sujit Apr 04 '18 at 10:24
  • 1
    i found out that why there is three jobs for csv read. it is happening because of options that i have given in code : inferschema and header that i have set to true. If i set inferchema and header false then there is only one job. – Pratibha Baghare Apr 05 '18 at 07:59
  • The answer of why stages are identical. i don't understand the statement that the first Spark job is to scan the first partition and since it had not enough rows led to another Spark job to scan more partitions. here in the statement what does mean by "it had not enough rows" ?? – Pratibha Baghare Apr 05 '18 at 08:11

1 Answers1

0

By default csv,json and parquet will create 2 jobs but if we enable inferSchema for csv file then it will create 3jobs.