5

I've stuck into a weird situation when the simplest possible spark application seemingly completed the same job twice.

What I've done

The application itself executes the query:

SELECT date, field1, field2, ..., field10
FROM table1
WHERE field1 = <some number>
  AND date BETWEEN date('2018-05-01') AND date('2018-05-30')
  ORDER BY 1

and stores the results into HDFS.

Table table1 is a bunch of parquet files stored on HDFS and partitioned as follows

/root/date=2018-05-01/hour=0/data-1.snappy.parquet
/root/date=2018-05-01/hour=0/data-2.snappy.parquet
...
/root/date=2018-05-01/hour=1/data-1.snappy.parquet
...
/root/date=2018-05-02/hour=0/data-1.snappy.parquet
...
etc.

All parquet files are from 700M to 2G size and have the same schema: 10 non null fields of int or bigint types.

The result of the application is tiny in size -- a couple of thousand rows only.

My spark application was running on YARN with cluster mode. Base spark parameters were

spark.driver.memory=2g
spark.executor.memory=4g
spark.executor.cores=4
spark.dynamicAllocation.enabled=true
spark.shuffle.service.enabled=true
spark.submit.deployMode=cluster

During execution a couple of containers were preempted, no errors and no failures occured. The whole application completed in one attempt.

The weird thing

Screenshots from Spark UI:

  • main screen
  • stage 2
  • stage 4

As it can be seen stage 2 and 4 both processed the same number of input rows, but stage 4 also did some shuffling (those were the result rows). The failed tasks are those which containers were preempted.

So it looks like my application processed the same files twice.

I have no clue how's that possible and what happened. Please, help me understand why it Spark is doing such a weird thing.

Actual physical plan:

== Physical Plan ==
Execute InsertIntoHadoopFsRelationCommand InsertIntoHadoopFsRelationCommand hdfs://hadoop/root/tmp/1530123240802-PrQXaOjPoDqCBhfadgrXBiTtfvFrQRlB, false, CSV, Map(path -> /root/tmp/1530123240802-PrQXaOjPoDqCBhfadgrXBiTtfvFrQRlB), Overwrite, [date#10, field1#1L, field0#0L, field3#3L, field2#2L, field5#5, field4#4, field6#6L, field7#7]
+- Coalesce 16
   +- *(2) Sort [date#10 ASC NULLS FIRST], true, 0
      +- Exchange rangepartitioning(date#10 ASC NULLS FIRST, 200)
         +- *(1) Project [date#10, field1#1L, field0#0L, field3#3L, field2#2L, field5#5, field4#4, field6#6L, field7#7]
            +- *(1) Filter (isnotnull(field1#1L) && (field1#1L = 1234567890))
               +- *(1) FileScan parquet default.table1[field0#0L,field1#1L,field2#2L,field3#3L,field4#4,field5#5,field6#6L,field7#7,date#10,hour#11] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://hadoop/table1], PartitionCount: 714, PartitionFilters: [(date#10 >= 17652), (date#10 <= 17682)], PushedFilters: [IsNotNull(field1), EqualTo(field1,1234567890)], ReadSchema: struct<field0:bigint,field1:bigint,field2:bigint,field3:bigint,field4:int,field5:int,field6:bigint,field7:...

Here are DAGs for Stages 2 and 4:

  • stage 2
  • stage 4
zavyrylin
  • 342
  • 3
  • 11
  • could you include the plan? (using `df.explain()`) – Raphael Roth Jun 28 '18 at 16:25
  • Sure. Added to the bottom of my lengthy question. – zavyrylin Jun 28 '18 at 18:35
  • Can you share DAG graph – vaquar khan Jun 28 '18 at 19:23
  • Attached a couple of pictures. Is it possible to get the whole application's DAG containing all the stages in a single picture? – zavyrylin Jun 28 '18 at 19:39
  • @MaxZavyrylin Spark starts a job to perform schema inference, But it is only on a sample of data. Just to eliminate that as a cause can we run the job with `spark.sql.hive.caseSensitiveInferenceMode=false` – DaRkMaN May 26 '19 at 05:12
  • @DaRkMaN 1) We store schema in Hive metastore, so there's no need to infer it (I guess :). 2) Stages 2 and 4 both give the same statistics about the number of processed records and consume +/- equal amount of time, so it doesn't look as sampling. – zavyrylin May 27 '19 at 06:24
  • @MaxZavyrylin though schema is stored in Hive metastore, spark does schema inference to deal with case sensitive schema, which gets stored in `TBLPROPERTIES` of Hive Table. But like you said, since the statistics are similar, doesnt look like sampling. Also I can observe that it is not the complete query but only the file scan executes twice. – DaRkMaN May 27 '19 at 07:09

2 Answers2

4

I ran into this exact same issue, and it turns out this behavior is entirely normal.

I observed this behavior in a Spark job that simply reads from HDFS, does some lightweight processing, and uses the orderBy method to sort on a column before writing back to HDFS. In the Spark UI, I saw two jobs that would scan the entire 6 TB table, just as you did. The first job used very little memory, wrote no shuffle records, and wrote no records to HDFS.

It turns out that the root cause is that before actually sorting the data, Spark executes a sampling operation that helps it define a RangePartitioner it uses to partition the data for its sorting algorithm: It needs to know the approximate range of the data in the column that defines the sort key to define a good RangePartitioner.

This operation is mentioned on this blog:

https://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/

this StackOverflow post:

How does Spark achieve sort order?

and also in the great book, "High Performance Spark," by Holden Karau and Rachel Warran, on pg. 143.

In my case, I know the range of the key, so it occurred to me that I should in principle be able to define a RangePartitioner a priori. However, I dug around in the Spark source code for its sort method, but I didn't find any kind of workaround where I could pass a range explicitly.

dchristle
  • 132
  • 1
  • 6
  • The above DAG shows that the data was completely read again instead of Sampling a percentage of the data. Is this expected?. Also, I think sampling happens in a separate job, instead of a separate stage. Please correct me If I am wrong – DaRkMaN May 26 '19 at 05:09
  • @DaRkMaN That's correct and expected: RangePartitioner is implementing reservoir sampling via the `sketch` method in `Partitioner.scala`. In order to generate a truly random sample, all elements must be read. Imagine if you had a single column of 100 integers from 1 to 100 you wanted to sort that was already sorted in ascending order, and only read the first 10% of this data. You'd only get read some of the smallest values (1-11), so your estimated "range" wouldn't be right. – dchristle May 27 '19 at 22:31
  • But the parquet files as far as I know have some statistics that show what's the biggest value in the parquet per column and row group without having to scan it. I've seen those stats are what it uses for dictionary filtering. Does spark not uses this stats from the parquets to get its range and instead it scans the whole data pages in the parquets? – Eugenio.Gastelum96 Jul 18 '23 at 19:07
0

I'm still not sure why spark behaves that way and I'm still digging but I've managed to get what happens.

Note: my SQL ends with ORDER. Since it is expected that the job will return very few rows I thought that final sorting should be an easy-peasy task.

So, when I remove ORDER clause, my query runs as expected and reads parquets only once. This weird behaviour is reproducible no matter how big the dataset is and how many times tasks are preempted during execution: adding order clause causes spark to scan twice the entire dataset (at least it looks like that).

I forgot to mention: I'm using spark 2.3.0 from Hortonworks' distribution (HDP-2.6.5).

zavyrylin
  • 342
  • 3
  • 11