I'm reading Hadoop: The definitive guide by Tom White. In chapter 13.6 "HBase vs RDMS" he said that if you have a lot of data, even simple queries like getting 10 recent items are extreamly expensive and they had to rewrite them using python and PL/SQL.
He gives the following query as an example:
SELECT id, stamp, type FROM streams
WHERE type IN ('type1','type2','type3','type4',...,'typeN')
ORDER BY stamp DESC LIMIT 10 OFFSET 0;
And says: "an RDBMS query planner treats this query as follows:
MERGE (
SELECT id, stamp, type FROM streams
WHERE type = 'type1' ORDER BY stamp DESC,
...,
SELECT id, stamp, type FROM streams
WHERE type = 'typeK' ORDER BY stamp DESC
) ORDER BY stamp DESC LIMIT 10 OFFSET 0;
The problem here is that we are after only the top 10 IDs, but the query planner actually materializes an entire merge and then limits at the end. .... We actually went so far as to write a custom PL/Python script that performed a heapsort. ... In nearly all cases, this outperformed the native SQL implementation and the query planner’s strategy...
Expected perforamnce and expermiental results
I couldn't imagine the data set that will cause such problems that you have to write pl/python to do such simple query right. So I've played for a while about this problem and came up with following observations:
The performance of such query is be bounded by O(KlogN). Because it can be translated to so something as follows:
SELECT * FROM (
SELECT id, stamp, type FROM streams
WHERE type = 'type1' ORDER BY stamp DESC LIMIT 10,
UNION
...,
SELECT id, stamp, type FROM streams
WHERE type = 'typeK' ORDER BY stamp DESC LIMIT 10
) t ORDER BY stamp DESC LIMIT 10;
(note the 'LIMIT 10' at each query. BTW I know that I can't limit and order unions but i've stripped out wrapping selects for sake of readability)
Each subquery should run as fast as finding the right postion in an index O(logN) and returning 10 items. If we repeat that K times we get O(KlogN).
And even if query planner is so bad that it can not optimize the first query we can always translate it to the query with unions and get the desired performance without writing anything in pl/python.
To double check my calculations I've run the queries above one postgresql filled with 9,000,000 of test records. The results confirmed my expectations both queries were quite fast 100ms for the first query and 300ms for second (the one with unions).
So if the query runs in 100ms for 9,000,000 (logn=23) of records then for 9,000,000,000 (logn=33) of records it should run in 140ms.
Questions
- Do you see any flaws in above reasoning?
- Can you imagine a data set where you would need to rewrite such query as above in pl/python?
- Do you see any situation in which such query wouldn't work in O(K log n)?