6

I'm using MS SQL.

I have a huge table with indices to make this query fast:

select userid from IncrementalStatistics where
IncrementalStatisticsTypeID = 5 and
IncrementalStatistics.AssociatedPlaceID = 47828 and
IncrementalStatistics.Created > '12/2/2010

It returns in less than 1 second. The table has billions of rows. There are only around 10000 results.

I would expect this query to also complete in about a second:

select userid from IncrementalStatistics where
IncrementalStatisticsTypeID = 5 and
IncrementalStatistics.AssociatedPlaceID = 47828 and
IncrementalStatistics.Created > '12/2/2010'

intersect

select userid from IncrementalStatistics where
IncrementalStatisticsTypeID = 5 and
IncrementalStatistics.AssociatedPlaceID = 40652 and
IncrementalStatistics.Created > '12/2/2010'

intersect

select userid from IncrementalStatistics where
IncrementalStatisticsTypeID = 5 and
IncrementalStatistics.AssociatedPlaceID = 14403 and
IncrementalStatistics.Created > '12/2/2010'

But it takes 20 seconds. All the individual queries take < 1 second and return around 10k results.

I would expect SQL internally to throw the results from each of these subqueries into a hashtable and do a hash-intersection - should be O(n). The result sets are big enough to fit in memory, so I doubt it's an IO issue.

I wrote an alternate query that is just a series of nested JOINs and this also takes around 20 seconds, which makes sense.

Why is INTERSECT so slow? Does it reduce to a JOIN at an early stage of the query processing?

John Shedletsky
  • 7,110
  • 12
  • 38
  • 63
  • 1
    "I doubt it's an io issue" -> what's the explain plan say that most expensive part of the query is? – Donnie Dec 06 '10 at 22:53
  • Does MS SQL have an EXPLAIN or some way to view the query plan? Based on other people's answers, sounds like the the INTERSECT implementation just isn't smart... – Brendan OConnor Dec 08 '10 at 14:27
  • @Brendan - yeah there's a nice visualization for the query plan. This query didn't seem subtle enough to need to resort to that - I was looking for the intuitive argument. – John Shedletsky Dec 08 '10 at 17:46

1 Answers1

14

Give this a try instead. Untested obviously, but I think it will get you the results you want.

select userid 
    from IncrementalStatistics 
    where IncrementalStatisticsTypeID = 5 
        and IncrementalStatistics.AssociatedPlaceID in (47828,40652,14403)  
        and IncrementalStatistics.Created > '12/2/2010'
    group by userid
    having count(distinct IncrementalStatistics.AssociatedPlaceID) = 3
Joe Stefanelli
  • 132,803
  • 19
  • 237
  • 235
  • Dude! That was a ton faster. I'd like to understand why? It seems like under the hood it's actually doing *more* work than the above. – John Shedletsky Dec 06 '10 at 22:52
  • 1
    @John Shedletsky: This is faster because it's a single pass on the IncrementalStatistics table vs. 3 completely separate queries. – Joe Stefanelli Dec 06 '10 at 22:54
  • 1
    @Joe: I doubt that's the reason it's so much faster. INTERSECTing 2 sets of 10000 in-memory strings takes well under 1s on any PC, so the only reason why John's query could be taking longer than 3*1+1+1=5s is because the DB engine is choosing a poor plan for his original query. – j_random_hacker Dec 07 '10 at 08:01
  • @j_random_hacker: Obviously it's hard to say without access to the OP's system. When I benchmarked this myself, logical reads and CPU times were very close for both versions, however the execution plan for the intersect version does include 3 separate index seek operations vs. one index seek for the group by/having version. – Joe Stefanelli Dec 07 '10 at 14:21
  • Good interview question. I'm going to use it on the next web engineer we hire. – John Shedletsky Dec 07 '10 at 17:36
  • 1
    @John: If you're looking for an interview question with a "right answer" (as opposed to just seeing a candidate work through possibilities), I think this is not a good choice, since it depends on knowing the unknowable internal behaviour of the MS SQL query planner. Sorry to be so negative, it's great that Joe's answer worked for you, but that is really just the luck of the draw (as he says, the 2 versions were very similar on his system). – j_random_hacker Dec 08 '10 at 03:58
  • 4
    Nice usage of `group by ... having count()` – Scotty.NET Nov 28 '13 at 11:35