What is the difference clustered vs non-clustered index when we calculate number of accesses pages?

Question

Employee (Ssn, FirstName, LastName, Gender, Age, Salary, DepartmentID) Assumptions: • There is no index. • There are total 8,000,000 rows. • 250,000 rows fall within the 2,500 to 3,000 salary range. • There are 100 rows for each page. • 400 index rows fit on a non-clustered index leaf page. What is the number of page accesses of the following queries for the following cases with the given assumptions? Consider each case separately. a) Select Ssn From Employee Where Salary>2500 and Salary<3000 and DepartmentID=1; b) Create Clustered Index ixEmpSal on Employee (Salary); Select Ssn From Employee Where Salary>2500 and Salary<3000 and Gender=’M’;

I have a employee table and I want to know number of page accesses of any query.I dont understand the some part of the following question

1-)how I use this part '400 index rows fit on a non-clustered index leaf page.' and What happens if I turn 'non-clustered index' part of the question into 'clustered index'

2-)in a part of the question the Does the database search separately for each and part?I mean that for example satabase search Salary>2500 then start begining to search Salary<3000 then search DepartmenId

ps:I don't understand theoretical explaining very well can you explain with example

Rick James · Answer 1 · 2023-01-24T02:01:38.110

First, let me recommend optimal indexes for the queries:

a) Select Ssn From Employee Where Salary>2500 and Salary<3000 and DepartmentID=1;

INDEX(DepartmentID, Salary) -- in this order

(The way that is written, both 2500 and 3000 are excluded. Did you really want that?)

b) Create Clustered Index ixEmpSal on Employee (Salary); Select Ssn From Employee Where Salary>2500 and Salary<3000 and Gender=’M’;

In MySQL, the only clustered index is the PRIMARY KEY. And, the PK must be unique. (This invalidates your Question 1) So, that index is probably not viable. Instead, a surrogate is often used.

id INT UNSIGNED NOT NULL AUTO_INCREMENT,
PRIMARY KEY(id),
INDEX(gender, salary)

(I'll present another option later.)

MySQL uses B+Trees for both the data (ordered by the PK) and each secondary (non-clustered) INDEX. Secondary index rows have a copy of the PK to look up the rest of the row.

Without a good index, MySQL will scan the entire table ("table scan"). Because it is a B+Tree, not just a BTree, it will read only the leaf nodes (after locating the 'first' row by drilling down the tree once). So the number of pages read is approximately 8,000,000/100 + 4-1.

log-base-100 (8M) = 4, so the B+Tree will be about 4 levels deep for the data. For each secondary index, log-base-400 (8M) ~= 3 levels.

WIth only INDEX(Salary): 250,000 / 400 ~= 625 index blocks plus 4 * 250,000 data blocks to check the dept or gender, plus fetch any other columns needed (SSN, as mentioned). The data blocks (in this case) will be heavily cached, especially the non-leaf nodes.

Select b Assuming an egalitarian company, with 50% Gender=M. Now we are looking at 250,000 * 50% / 400 ~= 313 index blocks. And half as many data blocks, if needed.

InnoDB uses 16KB blocks. So multiply by that to get the number of bytes on disk and/or in the buffer_pool (assuming it can all be cached). Tack on about 1% for the non-leaf nodes.

But... Once it has located the "rows" in the index, it needs to reach over into the data B+Tree to get any other columns needed. This will involve drilling down the data's B+Tree (the + is irrelevant here) to get each row. Now we are looking at 4 pages per data row fetched -- but with most being cached.

A different technique

PRIMARY KEY(Salary, id),  -- cluster on Salary
INDEX(id)                 -- to keep AUTO_INCREMENT happy
-- and neither of my indexes

Now both of your queries only need to look in data B+Tree. Select b is 250,000 / 100 ~= 2500 blocks and uses the + of B_Tree and does not use any secondary index. It will scan all 250K rows to see which ones match the Dept or Gender, but then the SSN is immediately available.

Note that tacking id onto Salary provides the Uniqueness required by the PK. So this comes closer to what you asked about.

Nodes are rarely full InnoDB's blocks are usually between 1/2 and 15/16 full, with an average of about 70%. This is because of block splits, etc, from Updates/Deletes/Inserts. This may mess with your "100" and "400", but it does not mess with the conclusions that I am presenting.

(Your Question 2) MySQL does not use two INDEXes in a single Select. (There are rare exceptions.) My suggested composite (multi-column) indexes are even better than trying to use two indexes.

(Security is even more important if SSN is sitting in your database.)

What is the difference clustered vs non-clustered index when we calculate number of accesses pages?

1 Answers1