0

Employee (ename, title, dname, address) all are string fields of the same length.

The ename attribute is a candidate key. The relation contains 10,000 pages. There are 10 buffer pages.

The query is: SELECT E.title, E.ename FROM Employee E WHERE E.title=‘Administrator’

Assume that only 10% of Employee tuples meet the selection condition.

Suppose that a clustered B+ tree index on ename is (the only index) available. What is the cost of the best plan?

How would I calculate this cost? and if there is a clustered B+ tree index on title, how would i calculate that?

Another query: SELECT E.ename FROM Employee E WHERE E.title=‘Administrator’ AND E.dname=‘Finance’

Assume that only 10% of Employee tuples meet the condition E.title =’Administrator’, only 10% meet E.dname =’Finance’ and only 5% meet both conditions.

Suppose that a clustered B+ tree index on is (the only index) available. What is the cost of the best plan?

Experts! Please help. Any comments/suggestions would be greatly appreciated. I want to understand the whole process. I have done alot of research and I think I know how to calculate the cost of each operation, what confuses me is that they're saying the relation contains 10,000 pages and not saying howmany tuples in each page? from what I learned, I believe we must have the total size of the relation in terms of tuples, am i correct? why?

THANKS anyone who took the time to read the question :-)

Shayaan
  • 35
  • 1
  • 5
  • I have no idea what the cost would be - but a query using *any* index for the first query is very likely to be slower than a full table scan in all cases. Probably true for the second query too. – symcbean Mar 18 '16 at 22:07
  • I think you need to use the [explain](http://dev.mysql.com/doc/refman/5.7/en/explain.html) statement to have a better understanding. – Alexander Popov Mar 18 '16 at 22:51

1 Answers1

2

Without a suitable index, the query will do a table scan. Since reading the rows is the dominant part of the execution time (in many cases); the variations you mention do not matter.

If you do have a relevant index, and that index is selective enough (10% is likely to be "selective enough"), there will be two steps to the query:

  1. Scan part of the index, which is a separate BTree.
  2. For each row, get the PRIMARY KEY (assuming you are using InnoDB) from that BTree. With that PK, look up the row in the main BTree, which contains the PK and the Data.

If all the necessary blocks are cached in the buffer_pool (again, assuming InnoDB), the variations in cost are rather minor.

If not all blocks are in cache (because mysqld just started up, or because the index/data is too big to remain cached), then you are into "counting the disk hits". This is because the dominant part of the "cost" is I/O. Now computing the cost is quite complex because of needing go know what percentage is already cached, whether the query will 'thrash' the cache, whether the 10% are evenly scattered, or clumped together, or something in between.

Since (for InnoDB), the PK is "clustered" with the data, a lookup by PK acts differently than a lookup by a secondary key.

10K rows is "small". 10 buffer pages -- what do you mean? "all are string fields of the same length" -- unrealistic and not a good idea to use CHAR instead of VARCHAR. Anyway, the string lengths have very little bearing on this discussion.

WHERE E.title=‘Administrator’ AND E.dname=‘Finance’ -- begs for INDEX(title, dname) in either order.

A "Rule of Thumb": One block (InnoDB) can hold 100 rows (of data or index). (This, of course, can vary wildly. But it is sometimes handy for "counting the disk hits".)

In my cookbook, I find it easier to focus on designing the 'best' index, without computing 'cost'.

Further notes on queries

"Assume that only 10% of Employee tuples meet the condition E.title =’Administrator’, only 10% meet E.dname =’Finance’ and only 5% meet both conditions." In the case of MySQL, here are more details:

Case 1: INDEX(title) -- similar to first query -- index range scan of 10%, then probes into data.
Case 2: INDEX(dname) -- ditto.
Case 3: Both indexes -- there is a slim chance of using "index merge intersect" to do two index "range scans", merge the two sets together, then reach into the data for the rows.
Case 4 (the best): INDEX(title, dname) (or opposite order): Back to an index range scan, but limited to only 5% of the items.

MySQL's preferred Engine is InnoDB. What I have discussed assumes that, not MyISAM. With InnoDB, "Data" is stored in a B+Tree, so is every secondary index. Keep this similarity in mind when thinking through how a query is performed. Note also that the 'leaf nodes' of a secondary index contain the PK, thereby providing the mechanism for finding the rest of the record.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • thanks alot Rick for replying.....I'm not designing anything, this is just a criteria for a database course question to compute the cost.....and that's all the information they provided....i wasn't sure how to go about this, so I asked.....they require the total number of I/O for each query with a b+ tree index – Shayaan Mar 20 '16 at 12:36
  • The "cost model" is poorly defined... What is in a "page"? Possible answer: "there are 100 rows of data or index in a 'page'." Do you need to count the non-leaf nodes in a BTree? If so, you need to assume the "fan-out". (That is where my RoT of "100" comes into play. Plus how many records in the table.) Etc. – Rick James Mar 20 '16 at 16:38