First, let me recommend optimal indexes for the queries:
a) Select Ssn From Employee Where Salary>2500 and Salary<3000 and DepartmentID=1;
INDEX(DepartmentID, Salary) -- in this order
(The way that is written, both 2500 and 3000 are excluded. Did you really want that?)
b) Create Clustered Index ixEmpSal on Employee (Salary); Select Ssn From Employee Where Salary>2500 and Salary<3000 and Gender=’M’;
In MySQL, the only clustered index is the PRIMARY KEY
. And, the PK must be unique. (This invalidates your Question 1) So, that index is probably not viable. Instead, a surrogate is often used.
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
PRIMARY KEY(id),
INDEX(gender, salary)
(I'll present another option later.)
MySQL uses B+Trees for both the data (ordered by the PK) and each secondary (non-clustered) INDEX
. Secondary index rows have a copy of the PK to look up the rest of the row.
Without a good index, MySQL will scan the entire table ("table scan"). Because it is a B+Tree, not just a BTree, it will read only the leaf nodes (after locating the 'first' row by drilling down the tree once). So the number of pages read is approximately 8,000,000/100 + 4-1.
log-base-100 (8M) = 4, so the B+Tree will be about 4 levels deep for the data. For each secondary index, log-base-400 (8M) ~= 3 levels.
- WIth only
INDEX(Salary)
: 250,000 / 400 ~= 625 index blocks plus 4 * 250,000 data blocks to check the dept or gender, plus fetch any other columns needed (SSN, as mentioned). The data blocks (in this case) will be heavily cached, especially the non-leaf nodes.
Select b Assuming an egalitarian company, with 50% Gender=M. Now we are looking at 250,000 * 50% / 400 ~= 313 index blocks. And half as many data blocks, if needed.
InnoDB uses 16KB blocks. So multiply by that to get the number of bytes on disk and/or in the buffer_pool (assuming it can all be cached). Tack on about 1% for the non-leaf nodes.
But... Once it has located the "rows" in the index, it needs to reach over into the data B+Tree to get any other columns needed. This will involve drilling down the data's B+Tree (the +
is irrelevant here) to get each row. Now we are looking at 4 pages per data row fetched -- but with most being cached.
A different technique
PRIMARY KEY(Salary, id), -- cluster on Salary
INDEX(id) -- to keep AUTO_INCREMENT happy
-- and neither of my indexes
Now both of your queries only need to look in data B+Tree. Select b is 250,000 / 100 ~= 2500 blocks and uses the +
of B_Tree and does not use any secondary index. It will scan all 250K rows to see which ones match the Dept or Gender, but then the SSN is immediately available.
Note that tacking id
onto Salary
provides the Uniqueness required by the PK. So this comes closer to what you asked about.
Nodes are rarely full InnoDB's blocks are usually between 1/2 and 15/16 full, with an average of about 70%. This is because of block splits, etc, from Updates/Deletes/Inserts. This may mess with your "100" and "400", but it does not mess with the conclusions that I am presenting.
(Your Question 2) MySQL does not use two INDEXes
in a single Select. (There are rare exceptions.) My suggested composite (multi-column) indexes are even better than trying to use two indexes.
(Security is even more important if SSN is sitting in your database.)