Create sql indexes for complex filtering

Question

There is table in sql database human. I have ui for this table and filter form like this:

I can set only some values (for instance age and state only). If filter item is not specified it won't be add to sql WHERE condition. WHERE condition is combined in order described on picture. So if I want to create indexes for all cases to get performance boost I need to create this indexes:

first name
last name
age
state
birthday
gender
first name + last name
first name + last name + age
first name + last name + age + state
...
state + birtday
state + birthday + gender
...
state + gender

It looks bad for me. Should I select only most used combinations? What do you think?

Which RDBMS is this for? Please add a tag to specify whether you're using `mysql`, `postgresql`, `sql-server`, `oracle` or `db2` - or something else entirely. — marc_s, Jul 24 '16 at 14:02
(1) Use tags and not comments to specify vendor specifics; that is what tags are for, and comments are (assumed to be) temporary. (2) Which is it: MySQL or SQL Server? Those are two **very different** engines, with dramatically different capabilities in some instances. (3) Never (not *almost never*; but actually ***never***) put *age* in an index - as it changes for 0.3% of rows every day, requiring constant index rebuilds. Store BirthDate and calculate an *age* query in terms of BIrthDate and CurrentDate. — Pieter Geerkens, Jul 24 '16 at 14:51

Ben Rubin · Accepted Answer · 2022-11-22T03:55:41.137

If you have the index first name + last name + age + state, you don't also need first name + last name + age and first name + last name and first name. If you have the index first name + last name + age + state and a user searches only on "first name" and "last name", the database will be able to use that index. As long as the user specifies columns in the same left-to-right order as your index, the database will be able to use the index even if every column isn't specified.

For instance, if you have the index first name + last name + age + state and the user specifies "first name" and "last name", then the database will be able to use that index to jump to the matching rows. However, if the user specifies "first name" and "age", or "first name" and "state", then the database will only partially use the index to jump to the rows with matching first names, but it will then have to do a scan for the rows that match "age" or "state". If you want to know the technical details behind why this is true, read about database indexes and B+ trees. This is a good explanation.

Databases can also use multiple indexes when running a single query. If you have the indexes

`last name`
`state`
`age`

And the user searches for "last name", "state", and "age", the database will be able to use all three indexes to quickly find the matching rows for each field, and then the results will be combined and rows that don't match all three indexes will not be selected. If you look at an execution plan, you'll be able to see it doing this. Granted this will be a tiny bit slower than having a single index that has every necessary field in it, but it will prevent you from having a ton of indexes.

Also note that even if an index exists, the database may not necessarily use that index because doing a row scan maybe faster. For instance, take the above example with three different indexes, and suppose the user does a search on "last name", "first name", and "state". Because the combination of "last name" and "first name" has a such a high selectivity (meaning most of the values in that index are unique), it might be faster to just use the index to get all the rows that match the first name and last name and then just do a simple iterative scan on those rows to find the ones that also have the matching state, than to use the state index as well, and then join the rows that were returned by both indexes.

When you're designing your indexes, an index won't give you much of a performance boost (and may actually be worse than doing a full table scan) if the selectivity of your index is really low. Gender, for instance, is not a good field to have indexed because you only have two possible values. If the user is searching only on gender, you will never get good performance with or without indexes because you will return half your rows.

Row-for-row, a full table scan is actually faster than using an index. The reason for this is that when the database does a table scan, it is able to jump straight to the data page on disk. When it uses an index, it has to go through a few intermediate index pages before it actually gets to where the data is stored on disk. For a field like "gender" where you're going to be selecting half of your rows, the added overhead of following your index links for half the rows in the table may outweigh the cost of just scanning the entire table without using indexes.

In your example, I would recommend the following indexes:

`last name`
`first name`
`birthdate`
`state`

If you have a specific combination of fields that is searched on frequently, then you can make an index for that too to speed things up. However, don't make an index for every combination of fields.

If you use "birthdate" instead of "birthday", then you don't need "age" because you can calculate that based on "birthdate" and then do a between query on "birthdate". If you're forced to have separate columns for "birthday" and "age", then you could index "age" as well. However, like another user commented below, you'd have to constantly update your ages. I strongly recommend against that design.

One final thing to consider is whether to try to make a covering index. A covering index is one in which every field that the user searched for is part of your index. For example, suppose your table has 100 fields in it, but users are usually only interested in looking up someone's state and age based on their name. So a large percentage of your queries look something like this

SELECT STATE, AGE FROM PEOPLE WHERE FIRSTNAME = 'Homer' AND LASTNAME = 'Simpson'

If your index is LASTNAME, FIRSTNAME, then the database will look up "Homer" and "Simpson" in your index (which will involve reading a few index pages from disk), use the index pointer to go to the disk page where the data record is stored, read that entire data page, parse it into fields, and then return the state and age.

Now, suppose you run the same query but your index is LASTNAME, FIRSTNAME, STATE, AGE. The database engine will still use your index to look up "Homer" and "Simpson", but once it finds the appropriate index record (exactly the same as how it worked above), that index record already has STATE and AGE. Therefore, the database can get the results of your query straight from the index without having to also read the data page from disk.

A situation where a covering index can drastically improve performance is in the case of table scans. Assume you have 100 fields in your table (so the size of a single row is a few hundred bytes or more). Now a user runs the query

SELECT FIRSTNAME, LASTNAME, AGE FROM PEOPLE

The database would have to read the entire table (including all 100 fields which aren't necessary for this query) to get your results. If you had an index LASTNAME, FIRSTNAME, AGE, then the database could get the results by scanning your entire index instead of scanning the entire table. Since in this case a single index element is far smaller byte-wise than a single data row, the query will be much faster.

In your particular case with so few fields in your table, a covering index probably wouldn't be very useful since the fields in the index would be the same as the fields in your table, thus defeating the whole purpose. However, for a table with dozens of fields, of which only a handful are commonly queried, a covering index can be a great way to speed up your queries.

A really good explanation; much better than the two others currently posted. A quick description of *covering index* is all I can think of that might usefully be added. — Pieter Geerkens, Jul 24 '16 at 14:42
"Databases can also use multiple indexes when running a single query" - that may be correct for some databases or certain types of index, but it's not generally true for all flavours of RDBMS. — APC, Jul 24 '16 at 16:06
I've been searching for days to find this satisfying answer. Thank you. — Yomna Mansour, Mar 15 '22 at 11:37

score 4 · Answer 2 · answered Jul 24 '16 at 18:36

4

Lots of indexes is a 'bad' idea.
Indexes on individual columns won't help much.
One index that is a 'prefix' of another is redundant.
An index on a flag or column of low 'cardinality' (eg gender), won't be used.

Suggestion: Start with one index per column. Then add on a second column to each index. Pick this second column based on what is likely to be tested together. Avoid having both (a,b) and (b,a)

Then watch what types of queries are generated by 'real' users. Tweak the list of indexes accordingly. This info may lead to a few 3-column indexes.

answered Jul 24 '16 at 18:36

Rick James

135,179
13
127
222

Nice concise answer. I had a question regarding a specific scenario. Say I have a table with 15 columns of which 9 fields are used to filter the data. E.g, user can filter on upvotes, author, views, date (the usual) and a bunch of others. In this case where it is hard to predict the order in which the user will select the filters, what indexes should be chosen? Should the server first change the order of fields according to our defined index order before querying the database? – Fanoflix Nov 17 '22 at 06:41
1

@Fanoflix - That much to big a question to answer in Comments. Read my [_Index Cookbook_](http://mysql.rjweb.org/doc.php/index_cookbook_mysql) , then ask a Question with specifics. Then I can give you an answer that is somewhere between concise and thorough. Meanwhile, keep track of what filters the users actually use; it will play a part in my Answer. – Rick James Nov 17 '22 at 16:06
Got it, thanks for the link. I went over it and it seems most of my questions will be answered there. I will surely get back to you once I have the specifics. – Fanoflix Nov 18 '22 at 04:37

score 0 · Answer 3 · answered Jul 24 '16 at 14:03

One index can work for multiple where clauses. So:

(firstname, lastname, age, state)

works for where clauses that have equality conditions for:

firstname
firstname & lastname
firstname & lastname & age
firstname & lastname & age & state

I would suggest that you build a set of indexes for the common cases -- three or four indexes. Add multiple keys to the index, so it can be used for more and more refined searches. Don't bother putting low-cardinality values, such as gender as the first key in the index, because a query using just a filter on gender is probably going to require a full table scan anyway.

If this doesn't meet your needs, you might need to think about other methods for accessing the data, such as full-text indexes.

Age is a horrible attribute to put in any index - as it changes daily for 0.3% of all records making the index obsolete. Birthdate is fine, but any filter on age must be refined into a filter on the comparison of current date to birth date. — Pieter Geerkens, Jul 24 '16 at 14:39

score -1 · Answer 4 · answered Jul 24 '16 at 13:41

I would go with this approach..

Having a key column on index is great for filtering out rows and doing a seek exactly.But with your form,you need many keys as key columns,but having many key columns is not good and it has a limit too..

So i suggest you to identify few columns which are unique or composite index with fields that wont be null,if you dont have unique columns and create a clustered index..

I would create clustered index on birthday,age(just an idea ,you may use other columns as well) and then create a stored procedure with default parameters like below..

create proc usp_getformdata
(
@firstname varchar(200)= null,
@lastname varchar(200)=null,
@age int=null,
@state varchar(20)=null,
@birthday datetime =null,
@gender varchar(10)=null
)
As
Begin
select 
* from
yourtable
where 
firstname=@firstname
and 
lastname=@lastname

--do for all columns
End

Create sql indexes for complex filtering

4 Answers4