I have a spark job that reads from a sql server using jdbc connector
The table is indexed on columns named insertion_time
and car_id
.
At first the query I tried to use to read the table with was:
SELECT car_id, km_traveled, fuel_used, manufacturing_date, insertion_time
FROM car_info
WHERE insertion_time > '2022-01-01 00:00:01' AND km_traveled > 1000'
This query received a timeout as the table is not indexed on the km_traveled
column.
So I changed the query I use to read from the table to:
SELECT car_id, km_traveled, fuel_used, manufacturing_date, insertion_time
FROM car_info
WHERE insertion_time > '2022-01-01 00:00:01'
Then, immediately after the data is read I perform a filter km_traveled > 1000
(in the spark execution job as opposed to the read query).
Spark sees this filter and generates an optimized sub-query adding WHERE km_traveled > 1000
to the read query sent to the DB.
However, since the table is not indexed on the km_traveled
column, and the table has 1,500,000,000 rows, I believe this optimization is actually slowing down my query to a point where it still receives a timeout the aborts the query.
My question is, is there a way to disable such an optimization?
I saw that optimization rules in general can be disabled in here. But I could not find the specific rule that disables the aforementioned optimization.
Thanks in advance.