0

My first attempt at a question proved to be confusing and I received some mixed answers (probably due to my confusing question). Here's a different and better question...

Assume my table looks like this in MySQL:

CREATE TABLE `people` (
    `person_id` INT(11),
    `alias_num` TINYINT(3),
    `first_name` VARCHAR(255) NOT NULL,
    `last_name` VARCHAR(255) NOT NULL,
    PRIMARY KEY (`person_id`,`alias_num`)
  )
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB;

With data like this:

person_id alias_num first_name last_name
--------- --------- ---------- ---------
1         1         John       Smith
2         1         Joe        Smith
3         1         Bill       Smith     # <-- Notice this guy has 3 aliases
3         2         Billy      Smith     # <--
3         3         William    Smith     # <--
4         1         Susan      Thompson
...

Assume jo and smi were entered into an HTML search form (with both fields required) and my query will always be like this:

SELECT person_id FROM people WHERE first_name LIKE 'jo%' AND last_name LIKE 'smi%';

Question: What's the best index(es) to add to my table to make the above query the fastest?

Note: I did some quick testing on a table of almost a million rows and it looks like that 2 seperate indexes of first_name(15) and last_name(15) seem faster than a composite index of last_name(15),first_name(15) using SQL_NO_CACHE? But maybe I'm testing this wrong. I'm also considering that maybe a combination of both composite index and an index on a single name would be good (if that doesn't confuse the optimizer)?

Bonus Question:
Considering that I'm searching for partial words, not full words, would something like ElasticSearch do this query any better?

Community
  • 1
  • 1
prograhammer
  • 20,132
  • 13
  • 91
  • 118

4 Answers4

1

You're correct, separate first_name and last_name indexes are going to work better.

In my experience, a composite index is best on non-variable fields (like 2 numbers). I'd go with one index on each name field.

You might also adjust your my.cnf settings if you have not already, tweaking memory available to MySQL can make dramatic differences in sorting/searching on indexes.

As for my.cnf, that's a whole other question, IMO. You can start here: https://dev.mysql.com/doc/refman/5.6/en/server-default-configuration-file.html. Mysql ships with my-large.cnf, my-huge.cnf so those should give you a good start.

mikeb
  • 10,578
  • 7
  • 62
  • 120
  • +1. Awesome Mike! But let me let this question sit for just a little bit more of the day before I accept it. And for my.cnf settings, are you mainly just referring to `innodb_buffer_pool`? – prograhammer Nov 30 '15 at 16:07
  • One more thing Mike, if the form fields are both required (first_name and last_name) is there any point to having both indexed? I should just go with last_name index right? Since the optimizer probably isn't going to do an index merge, and who knows if index merge is best? – prograhammer Nov 30 '15 at 16:49
  • 2
    I would index both, because the index cardinality is what is going to determine which index should be used. Having both will let mysql determine which one it thinks is best, because it will depend, to an extent, on the data in the table (and therefore the index). It also has to do with how up-to-date the index statistics are, but the short answer is create both indexes. – mikeb Nov 30 '15 at 17:12
  • Yeah I thought so. I just tend to assume last_name would have the greater cardinality, but that's just my own guess. Better to let MySQL be the judge (as long as the data fits into RAM using both indexes). Is there anything open-source out there that is going to do this type of query better? Ie. ElasticSearch? – prograhammer Nov 30 '15 at 17:17
  • 1
    Well, how fast is it running on YOUR data, and how much faster do you need it to be? There are probably all kinds of ways to speed it up, but now you're getting caught up in *Premature Optimization* http://c2.com/cgi/wiki?PrematureOptimization unless you have a good reason to keep tweaking this. – mikeb Nov 30 '15 at 17:38
  • Good point. Also, see my additional answer below. Thank you so much! :-) – prograhammer Nov 30 '15 at 18:04
1

Adding to the answers above from @mikeb and @RickJames,

The MySQL docs say here:

For a BTREE index, an interval might be usable for conditions combined with AND, where each condition compares a key part with a constant value using =, <=>, IS NULL, >, <, >=, <=, !=, <>, BETWEEN, or LIKE 'pattern' (where 'pattern' does not start with a wildcard). An interval can be used as long as it is possible to determine a single key tuple containing all rows that match the condition (or two intervals if <> or != is used).

The optimizer attempts to use additional key parts to determine the interval as long as the comparison operator is =, <=>, or IS NULL. If the operator is >, <, >=, <=, !=, <>, BETWEEN, or LIKE, the optimizer uses it but considers no more key parts. For the following expression, the optimizer uses = from the first comparison. It also uses >= from the second comparison but considers no further key parts and does not use the third comparison for interval construction:

key_part1 = 'foo' AND key_part2 >= 10 AND key_part3 > 10

The single interval is:

('foo',10,-inf) < (key_part1,key_part2,key_part3) < ('foo',+inf,+inf)

It is possible that the created interval contains more rows than the initial condition. For example, the preceding interval includes the value ('foo', 11, 0), which does not satisfy the original condition.

When using LIKE on a key part of a composite, the key parts to the right are not used. So this confirms what @mikeb says in that two single indexes would work better because MySQL can judge which one has better cardinality and use it. However, I ended up using the answer from Rick James with last_name,first_name,person_id(prefix/size removed) since I was only selecting person_id. This acts as a covering index and works just as fast (possibly faster) in my tests than single separate indexes, plus gives me good sorting by last_name then first_name. Composite keys are usually the better way to go anyways.

prograhammer
  • 20,132
  • 13
  • 91
  • 118
1
SELECT person_id FROM people WHERE first_name LIKE 'jo%' AND last_name LIKE 'smi%';

Case 1 - Covering (rare): All the fields of the entire SELECT are included in the index. Either of these is "covering" and optimal:

INDEX(first_name, last_name, person_id)
INDEX(last_name, first_name, person_id)

"Covering" implies that it does all the work inside the index and does not need to touch the data. Note: The "Data" and the PRIMARY KEY live together in one BTree; each secondary index lives in another BTree.

Case 2 - non-Covering: If you don't want to, or can't (because of TEXT, etc) include all the fields, then either of these is optimal:

INDEX(first_name)
INDEX(last_name)

Create both indexes and let the optimizer pick the better one dynamically. INDEX(first_name, last_name) is of no use because of the wild card; it will not get past the first column of the index.

Prefixing: Do not use first_name(15). It will not save much space, and it will not help in performance. Just as with Case 2, it will not get past the first column in a composite index.

(255): Don't indiscriminately use VARCHAR(255). The 255 gets involved in the details of the temp table that might be used to perform the SELECT, and you will slow down the query over what would happen with a sensible max length. In some cases you will exceed a limit and not be allowed to build the index.

Secondary keys: In InnoDB, each "secondary key" implicitly includes all the columns from the PRIMARY KEY. So INDEX(first_name, last_name) will actually include person_id (and alias_num), thereby making equivalent to my recommended INDEX(first_name, last_name, person_id).

INDEX(a) and INDEX(a,b): The former is virtually always redundant; keep only the latter.

my.cnf: The most important setting for this discussion is to set innodb_buffer_pool_size to about 70% of available RAM.

Further discussion: Building an index from a SELECT and Compound indexes.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Great! Looks like `INDEX(last_name, first_name, person_id)` is the way to go (no prefixing and also I reduced the varchar(255) down to varchar(70) for the name fields). – prograhammer Nov 30 '15 at 21:01
0

It seems to use the key ?!?

DROP TABLE IF EXISTS my_table;

CREATE TABLE my_table
(id INT NOT NULL AUTO_INCREMENT PRIMARY KEY
,first_name VARCHAR(12) NOT NULL
,last_name VARCHAR(12) NOT NULL
,INDEX fl (first_name,last_name)
);

INSERT INTO my_table (first_name,last_name) VALUES
('John','Brown'),
('John','Smith'),
('John','Johnson'),
('John','Lewis'),
('John','Lennon'),
('John','Major'),
('James','Brown'),
('James','McIlroy'),
('James','Napier'),
('Jamie','Oliver'),
('James','May'),
('James','Martin');

SELECT * FROM my_table WHERE first_name LIKE 'Ja%' AND last_name LIKE 'Bro%';
+----+------------+-----------+
| id | first_name | last_name |
+----+------------+-----------+
|  7 | James      | Brown     |
+----+------------+-----------+

EXPLAIN 
SELECT * FROM my_table WHERE first_name LIKE 'Ja%' AND last_name LIKE 'Bro%';
+----+-------------+----------+-------+---------------+------+---------+------+------+----------+--------------------------+
| id | select_type | table    | type  | possible_keys | key  | key_len | ref  | rows | filtered | Extra                    |
+----+-------------+----------+-------+---------------+------+---------+------+------+----------+--------------------------+
|  1 | SIMPLE      | my_table | range | fl            | fl   | 28      | NULL |    6 |   100.00 | Using where; Using index |
+----+-------------+----------+-------+---------------+------+---------+------+------+----------+--------------------------+
Strawberry
  • 33,750
  • 13
  • 40
  • 57
  • It's a partial index lookup. It's using the `first_name` part. You see you get examined rows = 6 (all rows with "ja" prefix for first_name). It's not using the full composite because it's not using the `last_name` part. – prograhammer Nov 30 '15 at 16:22
  • @prograhammer I see - and we know that it doesn't use the whole index because it says 'using where'? – Strawberry Nov 30 '15 at 16:49
  • 1
    That's a good question. I'm not yet a master of EXPLAIN. But I think that 'using where' doesn't help you know if the composite index is being used fully. It does tell you that a range is being used on the index f1 with 6 examined rows. But honestly I'm not even sure the examined rows is the best way to know, because that isn't always accurate. But I just found this http://dev.mysql.com/doc/refman/5.7/en/range-optimization.html#range-access-multi-part – prograhammer Nov 30 '15 at 17:47
  • *"The optimizer attempts to use additional key parts to determine the interval as long as the comparison operator is =, <=>, or IS NULL. If the operator is >, <, >=, <=, !=, <>, BETWEEN, or LIKE, the optimizer uses it but considers no more key parts."* – prograhammer Nov 30 '15 at 17:50