Interval 30 days getting converted to Interval 4 weeks 2 days

Question

I am getting an error in a PySpark code when I am using below query in spark.sql clause, and it is supporting in BigQuery when I run it in BigQuery directly.

df = spark.sql('''SELECT h.src, AVG(h.norm_los) AS 

mbr_avg FROM UM h WHERE h.orig_dt < 

date_sub(current_date,INTERVAL 30 DAY) AND CAST((DATE_ADD(DATE(h.orig_dt), INTERVAL 18 

MONTH)) AS DATE) >= date_sub(current_date,INTERVAL 30 DAY) AND h.src IS NOT NULL GROUP BY 

h.src_cumb_id''')

When I am running the same query using PySpark under dataproc cluster, it is converting the interval 30 days to interval 4 weeks 2 days and I am getting below error.

cannot resolve 'date_sub(current_date(), interval 4 weeks 2 days)' due to data type mismatch: argument 2 requires int type, however, 'interval 4 weeks 2 days' is of calendarinterval type

Note: I am using option("intermediateFormat","orc") as well.

Any help on this would be appreciated

score 1 · Answer 1 · answered Aug 28 '22 at 01:18

1

date_sub in Spark SQL 1 and BigQuery 2 are different.

In Spark SQL

date_sub(start_date, num_days) - Returns the date that is num_days before start_date.

Examples:

> SELECT date_sub('2016-07-30', 1);
 2016-07-29

answered Aug 28 '22 at 01:18

Dagang

24,586
26
88
133

Thank You for clearing it out. Could you please make me understand, how can I convert it in such a way, that it works in both platform? Should I pass my QUERY in a different way, or i will have to create a new query all together? – Saurabh Aug 28 '22 at 05:30
AFAIK, Spark SQL and BigQuery use different SQL dialects and have their own built-in functions, they are similar but not exactly the same. I think you have to convert them manually. – Dagang Aug 28 '22 at 06:23

score 1 · Accepted Answer · answered Aug 31 '22 at 07:47

Little bit changes to the above query and now its working as expected.

Updated Query:

spark.sql(SELECT h.src, AVG(h.norm_los) AS 
mbr_avg FROM UM h WHERE h.orig_dt < 
date_sub(current_date,30) AND CAST((add_months(DATE(h.orig_dt),-18)) AS DATE) >= date_sub(current_date,30) AND h.src IS NOT NULL GROUP BY
h.src_cumb_id)

Interval 30 days getting converted to Interval 4 weeks 2 days

2 Answers2