4

I am just trying to understand the query plans generated in SparkSQL (2.4). I have the following query and it's corresponding query plan (below). (The query is just a test query).

create temporary view tab_in as
select distinct 
       mth_id 
from tgt_tbl;

select /*+ BROADCAST(c) */
a.mth_id,
a.qtr_id,
a.prod_id,
a.sale_date
from my_table a
left anti join tab_in c
on a.mth_id = c.mth_id;

Explain plan:

+- *(3) Project [mth_id#652, qtr_id#653, prod_id#655, sale_dt#656]
   +- *(3) BroadcastHashJoin [mth_id#652], [mth_id#867], LeftAnti, BuildRight
      :- *(3) Project [mth_id#652, qtr_id#653, sale_dt#656, prod_id#655]
      :  +- *(3) Filescan parquet test_db.my_table[mth_id#652, qtr_id#653, prod_id#655, sale_dt#656] Batched: true, Format: Parquet, Location: CatalogFileIndex[s3://test-data/my_table/0], PartitionCount: 1, PartitionFilters: [], PushedFilters: [], ReadSchema ......
      +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)))
         +- *(2) HashAggregate(keys=[mth_id#867], functions=[], output=[mth_id#867]
            +- Exchange hashpartitioning(mth_id#867, 200)
               +- *(1) HashAggregate(keys=[mth_id#867], functions=[], output=[mth_id#867])
                  +- *(1) Project [mth_id#867]
                     +- *(1) Filter isnotnull(mth_id#867)
                        +- *(1) FileScan parquet test_db.my_table[mth_id#867] Batched:true, Format:  Parquet, Location: InMemoryFileIndex[s3://test-data/tgt_tbl/20200609], PartitionFilters: [], PushedFilters: [IsNotNull(mth_id)], ReadSchema struct<mth_id:int>

As could be seen from above, there are 2 HashAggregates being performed in the plan - 1 before and 1 after the Exchange HashPartitioning. I figured that the first HashAggregate is probably due to the presence of the DISTINCT clause in the first query, but I can not seem to fathom out the reason for the second HashPartitioning (after the Exchange).

I have tried combining both the queries into 1 query by putting the first query in a WITH CTE clause, but still got the same results.

Can someone please explain the need for the second (reading from below) HashAggregate.

Any help is appreciated. Thanks

marie20
  • 723
  • 11
  • 30

1 Answers1

6

Both HashAggregates in the plan are because of the deduplication (distinct). The HashAggregate usually comes in a pair. Here the first one is responsible for local deduplication on each executor. After that follows Exchange - the data has to be shuffled and the second HashAggregate is responsible for the final deduplication after the shuffle.

David Vrba
  • 2,984
  • 12
  • 16
  • thanks @David.. it makes sense.. Is this true for other aggregations too (e.g. Group By etc) ? Btw, I have watched your videos on Spark Physical plans. I am interested in learning Spark Sql Query Plans more.. I'd appreciate if you refer me to other online resources regarding tuning Spark queries. – marie20 Jun 17 '20 at 06:28
  • 1
    @marie20 Yes, it works the same way for `groupBy` transformations as well. Regarding the query plans, there are not many resources out there. You can check my recent article about using `repartition` to optimize queries in Spark SQL: https://towardsdatascience.com/should-i-repartition-836f7842298c – David Vrba Jun 17 '20 at 19:28
  • thanks a lot @David.. one final question - we use plain sql queries (example in my question above) but not `PySpark-SQL`, hence would like to know what is the usage equivalent syntax of `repartition` in plain `select` queries ? Would be great if you could pls provide the usage syntax with a simple `select` or `Join` query example. Appreciate your help.. – marie20 Jun 17 '20 at 21:58