Optimizing Large Table Join in PySpark

Question

I have a large fact table, roughly 500M rows per day. The table is partitioned by region_date.

I have to scan through 6 months of data every day, left outer join with another smaller subset (1M rows) based on an id & date column and calculate two aggregate values: sum(fact) if id exists in right table & sum(fact)

My SparkSQL looks like this:

SELECT
    a.region_date,
    SUM(case
          when t4.id is null then 0
          else a.duration_secs
        end) matching_duration_secs
    SUM(a.duration_secs) total_duration_secs
 FROM fact_table a LEFT OUTER JOIN id_lookup t4
       ON a.id = t4.id
      and a.region_date = t4.region_date
 WHERE a.region_date >= CAST(date_format(DATE_ADD(CURRENT_DATE,-180), 'yyyyMMdd') AS BIGINT)
   AND a.is_test = 0
   AND a.desc = 'VIDEO'
GROUP BY a.region_date

What is the best way to optimize and distribute/partition the data? The query runs for more than 3 hours now. I tried spark.sql.shuffle.partitions = 700

If I roll-up the daily data at "id" level, it's about 5M rows per day. Should I rollup the data first and then do the join?

Thanks,

Ram.

score 0 · Answer 1 · answered Nov 10 '16 at 06:10

0

Because there are some filter conditions in your query, I thought you can split your query into two queries to decrease the amount of data first.

       table1 = select * from fact_table 
       WHERE a.region_date >= CAST(date_format(DATE_ADD(CURRENT_DATE,-180), 'yyyyMMdd') AS BIGINT)
       AND a.is_test = 0
       AND a.desc = 'VIDEO'

Then you can use the new table which is much smaller than the original table to join id_lookup table

answered Nov 10 '16 at 06:10

Yang Bryan

451
3
6

Thanks. My main question is, is there a way to optimize performance by using partition by region_date inside spark, or will Spark do it automatically? – Ram Nov 10 '16 at 08:11
I don't think spark.sql.shuffle.partitions will increase performance. But because you selected data by region_date, it will increase performance if you partition data by region_date. – Yang Bryan Nov 11 '16 at 03:31

Optimizing Large Table Join in PySpark

1 Answers1