Spark sql server query generates sub-query on non indexed column

Question

I have a spark job that reads from a sql server using jdbc connector

The table is indexed on columns named insertion_time and car_id.

At first the query I tried to use to read the table with was:

SELECT car_id, km_traveled, fuel_used, manufacturing_date, insertion_time
FROM car_info
WHERE insertion_time > '2022-01-01 00:00:01' AND km_traveled > 1000'

This query received a timeout as the table is not indexed on the km_traveled column.

So I changed the query I use to read from the table to:

SELECT car_id, km_traveled, fuel_used, manufacturing_date, insertion_time
FROM car_info
WHERE insertion_time > '2022-01-01 00:00:01'

Then, immediately after the data is read I perform a filter km_traveled > 1000 (in the spark execution job as opposed to the read query).

Spark sees this filter and generates an optimized sub-query adding WHERE km_traveled > 1000 to the read query sent to the DB.

However, since the table is not indexed on the km_traveled column, and the table has 1,500,000,000 rows, I believe this optimization is actually slowing down my query to a point where it still receives a timeout the aborts the query.

My question is, is there a way to disable such an optimization?

I saw that optimization rules in general can be disabled in here. But I could not find the specific rule that disables the aforementioned optimization.

Thanks in advance.

What columns does your index (presumably) on `insertion_time` `INCLUDE`? I suspect the real problem is your index isn't covering. Moving the column `km_traveled` from the `WHERE` to the `SELECT` is only likely to make the query slower if the column isn't in the `INCLUDE`, as the data engine now needs to scan *and* return that column or look it up as part of the key lookup. — Thom A, Jul 24 '23 at 16:04
A "non optimized query" - your term - would have to retrieve all 1,500,000,000 rows and then search in them (in the client?) for the km_traveled > 1000 records ... so that sounds bad too. As noted above, better to provide a covering index, so that the filtering can happen on the server without having to scan all rows in the table. — topsail, Jul 24 '23 at 16:08
Timeouts could also be that it generates statistics for km_traveled column, before creating appropriate query plan. You can usually see it in Activity Monitor if you see a query which doesn't nothing and cannot be shown in Query Plan mode — siggemannen, Jul 24 '23 at 16:16
I'm not really buying that the addition of `AND km_traveled > 1000` is the problem here. You say there is no index on that column so I would expect the plan to be much the same as whatever the plan for `SELECT car_id, km_traveled, fuel_used, manufacturing_date, insertion_time FROM car_info WHERE insertion_time > '2022-01-01 00:00:01'` is but with a residual predicate. Your question is missing information about whether `SELECT car_id, km_traveled, fuel_used, manufacturing_date, insertion_time FROM car_info WHERE insertion_time > '2022-01-01 00:00:01'` also timesout — Martin Smith, Jul 24 '23 at 16:29
Though the `km_traveled > 1000` predicate could cause auto stats update as suggested by siggemannen — Martin Smith, Jul 24 '23 at 16:52

score 0 · Answer 1 · answered Jul 24 '23 at 16:24

I'm not sure how to do this in Spark, but the workaround for this is usually to use a row-goal TOP on the query before adding the extra filter

SELECT car_id, km_traveled, fuel_used, manufacturing_date, insertion_time
FROM (
    SELECT TOP (1000000000) *
    FROM car_info
    WHERE insertion_time > '2022-01-01 00:00:01'
) t
WHERE km_traveled > 1000'

Incidentally, the optimum index for this query is probably one which has INCLUDE columns

INDEX IX NONCLUSTERED (insertion_time)
  INCLUDE (car_id, km_traveled, fuel_used, manufacturing_date)

Bear in mind that the server might actually be right that km_traveled is a better column to filter on. In which case you would use

INDEX IX NONCLUSTERED (km_traveled)
  INCLUDE (car_id, insertion_time, fuel_used, manufacturing_date)

Note that you cannot add both columns to the index to get an efficient result, as the WHERE clauses are both inequalities (they do not use =).

Spark sql server query generates sub-query on non indexed column

1 Answers1