I have a question about indexing and a strange (I suppose) behaviour I found when trying out some queries I'm using in a project.
I have the following schema:
a) an `activities` table
+---------+--------------------------------------------------------+------+-----+-------------------+-------------------+
| Field | Type | Null | Key | Default | Extra |
+---------+--------------------------------------------------------+------+-----+-------------------+-------------------+
| id | bigint unsigned | NO | PRI | NULL | auto_increment |
| date | timestamp | NO | UNI | CURRENT_TIMESTAMP | DEFAULT_GENERATED |
| address | varchar(255) | NO | | NULL | |
| title | varchar(255) | NO | | NULL | |
| state | enum('todo','done','interrupted','pending','rejected') | NO | | todo | |
+---------+--------------------------------------------------------+------+-----+-------------------+-------------------+
b) a users
table
+-------------------+---------------+------+-----+-------------------+-------------------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+---------------+------+-----+-------------------+-------------------+
| id | varchar(128) | NO | PRI | NULL | |
| email | varchar(1024) | YES | | NULL | |
| name | varchar(1024) | YES | | NULL | |
| surname | varchar(1024) | YES | | NULL | |
| phone_number | varchar(1024) | NO | | NULL | |
| propic_url | varchar(500) | YES | | NULL | +
c) and a worker_to_activity
table to assign workers to a specified activity
+-------------+-----------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+-----------------+------+-----+---------+-------+
| activity_id | bigint unsigned | NO | PRI | NULL | |
| worker_id | varchar(128) | NO | PRI | NULL | |
+-------------+-----------------+------+-----+---------+-------+
I have a very simple query that searches for all the activities in a certain time range in which a specified user is present, something like this:
select
a.date
a.id,
a.address,
a.title,
a.state
from activities a
where a.date >= "start" and a.date <= "end"
and exists ( select worker_id from worker_to_activity where worker_id = "worker_id" and activity_id = a.id )
limit 11;
And, last but not least, I have an index for this purpose:
create index act_fetch_idx (date, id, address, title, state) on activities;
For testing purposes (seeing if the index works) I deliberately inserted an activity that happens before other activities, but with an higher ID, simulating somebody inserting an activity later than expected (I'm talking about activity #50 in the following view):
+----+---------------------+
| id | date |
+----+---------------------+
| 1 | 2022-09-14 08:00:00 |
| 2 | 2022-09-14 09:00:00 |
| 3 | 2022-09-14 10:00:00 |
| 50 | 2022-09-14 10:55:49 |
| 4 | 2022-09-14 11:00:00 |
| 5 | 2022-09-14 12:00:00 |
| 6 | 2022-09-14 13:00:00 |
| 7 | 2022-09-14 14:00:00 |
| 8 | 2022-09-14 15:00:00 |
| 9 | 2022-09-14 16:00:00 |
| 10 | 2022-09-14 17:00:00 |
| 11 | 2022-09-14 18:00:00 |
| 12 | 2022-09-14 19:00:00 |
| 13 | 2022-09-15 08:00:00 |
| 14 | 2022-09-15 09:00:00 |
| 15 | 2022-09-15 10:00:00 |
| 16 | 2022-09-15 11:00:00 |
| 17 | 2022-09-15 12:00:00 |
| 18 | 2022-09-15 13:00:00 |
| 19 | 2022-09-15 14:00:00 |
| 20 | 2022-09-15 15:00:00 |
In this example, the index works! It's a simple select id, date from activities
Something strange happens when I use date ranges:
a) In this case (using a "daily" range) the index works:
select
a.date
a.id,
a.address,
a.title,
a.state
from activities a
where a.date >= "2022-09-14 00:00" and a.date <= "2022-09-14 23:59"
and exists ( select worker_id from worker_to_activity where worker_id = "worker_id" and activity_id = a.id )
limit 11;
Output:
+----+---------------------+
| id | date |
+----+---------------------+
| 1 | 2022-09-14 08:00:00 |
| 2 | 2022-09-14 09:00:00 |
| 3 | 2022-09-14 10:00:00 |
| 50 | 2022-09-14 10:55:49 |
| 4 | 2022-09-14 11:00:00 |
| 5 | 2022-09-14 12:00:00 |
| 6 | 2022-09-14 13:00:00 |
| 7 | 2022-09-14 14:00:00 |
| 8 | 2022-09-14 15:00:00 |
| 9 | 2022-09-14 16:00:00 |
| 10 | 2022-09-14 17:00:00 |
+----+---------------------+
b) When I use a wider range, i.e the activities in the next two days, the index "stops working (?)":
select
a.date
a.id,
a.address,
a.title,
a.state
from activities a
where a.date >= "2022-09-14 00:00" and a.date <= "2022-09-15 23:59"
and exists ( select worker_id from worker_to_activity where worker_id = "worker_id" and activity_id = a.id )
limit 11;
Output:
+----+---------------------+
| id | date |
+----+---------------------+
| 1 | 2022-09-14 08:00:00 |
| 2 | 2022-09-14 09:00:00 |
| 3 | 2022-09-14 10:00:00 |
| 4 | 2022-09-14 11:00:00 |
| 5 | 2022-09-14 12:00:00 |
| 6 | 2022-09-14 13:00:00 |
| 7 | 2022-09-14 14:00:00 |
| 8 | 2022-09-14 15:00:00 |
| 9 | 2022-09-14 16:00:00 |
| 10 | 2022-09-14 17:00:00 |
| 11 | 2022-09-14 18:00:00 |
+----+---------------------+
The EXPLAIN
of both queries is slightly different, but I don't understand what causes this behaviour:
a) EXPLAIN
a:
+----+-------------+--------------------+------------+--------+----------------------------+---------------+---------+-----------------------+------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------------+------------+--------+----------------------------+---------------+---------+-----------------------+------+----------+--------------------------+
| 1 | SIMPLE | a | NULL | range | PRIMARY,date,act_fetch_idx | act_fetch_idx | 4 | NULL | 13 | 100.00 | Using where; Using index |
| 1 | SIMPLE | worker_to_activity | NULL | eq_ref | PRIMARY,worker_id | PRIMARY | 522 | db.a.id,const | 1 | 100.00 | Using index |
+----+-------------+--------------------+------------+--------+----------------------------+---------------+---------+-----------------------+------+----------+--------------------------+
b) EXPLAIN
b:
+----+-------------+--------------------+------------+--------+---------------------------------------------+-----------+---------+-------------------------------------------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------------+------------+--------+---------------------------------------------+-----------+---------+-------------------------------------------+------+----------+-------------+
| 1 | SIMPLE | worker_to_activity | NULL | ref | PRIMARY,worker_id | worker_id | 514 | const | 23 | 100.00 | Using index |
| 1 | SIMPLE | a | NULL | eq_ref | PRIMARY,date,act_fetch_idx,count_activities | PRIMARY | 8 | db.worker_to_activity.activity_id | 1 | 53.49 | Using where |
+----+-------------+--------------------+------------+--------+---------------------------------------------+-----------+---------+-------------------------------------------+------+----------+-------------+
The naive solution is using ORDER BY
(which produces the same result as the first scenario both in execution plan and results), but my question was more about why this happens: shouldn't the index sort the results in both cases? I searched on the web but I couldn't find much.
Thanks in advance for your time.