Are RDBMS that bad as described in Hadoop: The definitive guide?

Question

I'm reading Hadoop: The definitive guide by Tom White. In chapter 13.6 "HBase vs RDMS" he said that if you have a lot of data, even simple queries like getting 10 recent items are extreamly expensive and they had to rewrite them using python and PL/SQL.

He gives the following query as an example:

SELECT id, stamp, type FROM streams 
WHERE type IN ('type1','type2','type3','type4',...,'typeN')
ORDER BY stamp DESC LIMIT 10 OFFSET 0;

And says: "an RDBMS query planner treats this query as follows:

MERGE (
  SELECT id, stamp, type FROM streams
    WHERE type = 'type1' ORDER BY stamp DESC,
  ...,
  SELECT id, stamp, type FROM streams
    WHERE type = 'typeK' ORDER BY stamp DESC
) ORDER BY stamp DESC LIMIT 10 OFFSET 0;

The problem here is that we are after only the top 10 IDs, but the query planner actually materializes an entire merge and then limits at the end. .... We actually went so far as to write a custom PL/Python script that performed a heapsort. ... In nearly all cases, this outperformed the native SQL implementation and the query planner’s strategy...

Expected perforamnce and expermiental results

I couldn't imagine the data set that will cause such problems that you have to write pl/python to do such simple query right. So I've played for a while about this problem and came up with following observations:

The performance of such query is be bounded by O(KlogN). Because it can be translated to so something as follows:

SELECT * FROM (
  SELECT id, stamp, type FROM streams
    WHERE type = 'type1' ORDER BY stamp DESC LIMIT 10,
  UNION
  ...,
  SELECT id, stamp, type FROM streams
    WHERE type = 'typeK' ORDER BY stamp DESC LIMIT 10
) t ORDER BY stamp DESC LIMIT 10;

(note the 'LIMIT 10' at each query. BTW I know that I can't limit and order unions but i've stripped out wrapping selects for sake of readability)

Each subquery should run as fast as finding the right postion in an index O(logN) and returning 10 items. If we repeat that K times we get O(KlogN).

And even if query planner is so bad that it can not optimize the first query we can always translate it to the query with unions and get the desired performance without writing anything in pl/python.

To double check my calculations I've run the queries above one postgresql filled with 9,000,000 of test records. The results confirmed my expectations both queries were quite fast 100ms for the first query and 300ms for second (the one with unions).

So if the query runs in 100ms for 9,000,000 (logn=23) of records then for 9,000,000,000 (logn=33) of records it should run in 140ms.

Questions

Do you see any flaws in above reasoning?
Can you imagine a data set where you would need to rewrite such query as above in pl/python?
Do you see any situation in which such query wouldn't work in O(K log n)?

No, they are not. What database queries a full table once for each item in a filter of a field, merges all the records together, orders them and then does the limiting at the end? — MkV, Nov 27 '10 at 00:41

score 6 · Accepted Answer · answered Nov 27 '10 at 16:36

Their assertion that an RDMBS query planner takes that solution to the query is incorrect, at least for Postgresql 9.0, and I should imagine for other platforms too. I did a quick test with a similar query:

explain select * from client_attribute where client_attribute_type_code in ('UAG', 'RFR', 'IPA', 'FVD') order by client_attribute_id desc limit 10;

                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.93 rows=10 width=85)
   ->  Index Scan Backward using client_attribute_pkey on client_attribute  (cost=0.00..15516.47 rows=167234 width=85)
         Filter: (client_attribute_type_code = ANY ('{UAG,RFR,IPA,FVD}'::bpchar[]))
(3 rows)

Here client_attribute_id is indexed, so it does exactly as desired- walks back through the index, applies the filter and stops when the output hits the limit.

If the ordering column is not indexed, a table scan and sort is requierd, but only one table scan:

explain analyze select * from client_attribute where client_attribute_type_code in ('UAG', 'RFR', 'IPA', 'FVD') order by updated desc limit 10;

                                                              QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=13647.00..13647.03 rows=10 width=85) (actual time=180.961..180.964 rows=10 loops=1)
   ->  Sort  (cost=13647.00..14065.09 rows=167234 width=85) (actual time=180.960..180.961 rows=10 loops=1)
         Sort Key: updated
         Sort Method:  top-N heapsort  Memory: 26kB
         ->  Seq Scan on client_attribute  (cost=0.00..10033.14 rows=167234 width=85) (actual time=0.010..106.791 rows=208325 loops=1)
               Filter: (client_attribute_type_code = ANY ('{UAG,RFR,IPA,FVD}'::bpchar[]))

This uses a heapsort to maintain the top 10 results through the course of the sequential scan, which sounds exactly like the solution they wrote themselves.

score 4 · Answer 2 · answered Nov 26 '10 at 23:11

I don't think that Tom White is saying that relational databases are "bad"; they aren't optimal for non-relational, non-set based data.

It's been well known for a long time that deep object graphs don't lend themselves well to relational databases. They're typically found in problems like CAD representations of geometric data, where assemblies are made up of assemblies of assemblies of parts. The reference chains are very long, indeed.

Object and graph databases have been the solutions to that kind of problems since I was aware of them in the early 90s.

Relational databases are terrific for relational, set-based data. But all data doesn't fall into that category. That's why NoSQL is gaining mind share.

I think that's what the example you cite is saying.

What he appears to be saying is that query planners in RDBMSes are bad, writing it in Python is better, but using a made up example that isn't representative of actual query planners used by actual RDBMSes. — MkV, Nov 27 '10 at 00:23

score 1 · Answer 3 · answered Nov 26 '10 at 23:31

1

RDBMS is for the queries you haven't thought of. Once you are certain exactly what you want, you can then apply the most optimum solution.

answered Nov 26 '10 at 23:31

JeffO

7,957
3
44
53

score 1 · Answer 4 · answered Nov 27 '10 at 00:28

With either SQL or NoSQL, performance will be horrible if you design your queries in the wrong way.

I would fix that example by adding a check on timestamp to the where clause. If you have a lot of data, you can probably assume that the most recent 10 entries are from the last minute - so why try reading and sorting everything from the last month?

I could just as easily contrive the same example to make NoSQL look bad by claiming that because by default you can only find records by ID you will need to load the entire dataset to find the record you need, ignoring the ability to set up various secondary/custom indexes that will get you better than SQL performance for the queries that matter.

Are RDBMS that bad as described in Hadoop: The definitive guide?

4 Answers4