0

I know there are several questions similar to this one, but those I've found do not relate directly to my problem.

Some initial context: I have a facts table, called ft_booking, with around 10MM records. I have a dimension called dm_date, with around 11k records, which are dates. These tables are related through foreign keys, as usual. There are 3 date foreign keys in the table ft_booking, one for boarding, one for booking, and other for cancellation. All columns have the very same definition, and the amount of distinct records for each is similar (ranging from 2.5k to 3k distinct values in each column).

There I go:

EXPLAIN SELECT
*
FROM dw.ft_booking b
LEFT JOIN dw.dm_date db ON db.sk_date = b.fk_date_booking
WHERE date (db.date) = '2018-05-05'

enter image description here

As you can see, index is being used in the table booking, and the query runs really fast, even though, in my filter, I'm using the date() function. For brevity, I'll just state that the same happens using the column fk_date_boarding. But, check this out:

EXPLAIN SELECT
*
FROM dw.ft_booking b
LEFT JOIN dw.dm_date db ON db.sk_date = b.fk_date_cancellation
WHERE date (db.date) = '2018-05-05';

enter image description here

For some mysterious reason, the planner chooses not to use the index. Now, I understand that using some function over a column kind of forces the database to perform a full table scan, in order to be able to apply that function over the column, thus bypassing the index. But, in this case, the function is not over the actual foreign key column, which is where the lookup in the booking table should be ocurring.

If I remove the date() function, the index will be used in any of those columns, as expected. One might say, then, "well, why don't you just get rid of the date() function?" - I use metabase, an Interface which allow users to use a graphical interface in order to build queries without knowing MySQL, and one of the current limitations of that tool is that it always uses the date() function when building queries not directly written in MySQL - hence, I have no way to remove the function in the queries I'm running.

Actual question: why does MySQL use index in the first two cases, but doesn't in the latter, considering the amount of distinct values is pretty much the same for all columns and they have the exact smae definition, apart from the name? Am I missing something here?

EDIT: Here is the CREATE statment of each table involved. There are some more, but we just need here tables ft_booking and dm_date (first two tables of the file).

Lucas Lima
  • 832
  • 11
  • 23

1 Answers1

1

You are "hiding date in a function call". If db.date is declared a DATE, then

    date (db.date) = '2018-05-05'

can be simply

    db.date = '2018-05-05'

If db.date is declared a DATETIME, then change to

        db.date >= '2018-05-05'
    AND db.date  < '2018-05-05' + INTERVAL 1 DAY

In either case, be sure there is an index on db.date.

If by "I have a dimension called dm_date", you mean you built a dimension table to hold just dates, and then you are JOINing to the main table with some id, ... To put it bluntly, don't do that! Do not normalize "continuous" things such as DATE, DATETIME, FLOAT, or other numeric values.

If you need to discuss this further, please provide SHOW CREATE TABLE for the relevant table(s). (And please use text, not screen shots.)

Why??

The simple answer is that the Optimizer does not know how to unravel any function. Perhaps it could; perhaps it should. But it does not. Perhaps the answer involves not wanting to see how the function result will be used... comparing against a DATE? against a DATETIME? being used as a string? other?

Still, I suggest the real performance killer is the existence of dm_date rather than indexing and using the date in the main table.

Furthermore, the main table is bigger than it needs to be! fk_date_booking is a 4-byte INT SIGNED instead of a 3-byte DATE.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • I know I can ommit the date() function. But, as I explained, with my interface, I can't get rid of that function. So the actual question is why would MySQL use the index in one case, and no the other. As I said in the original post, it is a date function, and I know how to write the query you suggested. That is not my doubt at all. As it can be seen, as well, there is already an index in the column, i thought I had made that clear. Sorry. About the dimension: I did that. Why not to do it? About the screen shots: sorry. Won't happen again. – Lucas Lima Jun 10 '18 at 04:58
  • We have a table called dm_date because we don't store in it only the dates, but, rather, we classify dates (by quarter, weekday, holiday, and so on). I understand the aspect of "the Optimizer does not know how to unravel any function". But, see, I'm ordering a full table scan against dm_date, not agaisnt ft_booking. What kills me is it uses index that in the fiirst 2 cases. I can't tell why it does in those cases; hence, I was trying to figure out the difference between the first 2 and the latter. And, in my experience, dm_date has always nearly the same performance as the raw date column. – Lucas Lima Jun 11 '18 at 14:20
  • @LucasLima - The distribution of "bookings" is probably very regular. The distribution for "cancellations" is probably very lopsided -- mostly `NULL`. The Optimizer assumes the former, and will mess up with the latter. I'm not sure if this explains what happened. In any case, it is one of _many_ reasons why normalizing dates leads to problems. Ouch -- Are you storing "times" separately from "dates"? That is another performance and coding disaster waiting to bite you. – Rick James Jun 11 '18 at 15:44
  • I thought about that NULL thing, but most of the non cancelled records have fk set to 0 - which means "null date" in dm_date. But, yes, the column is not null, maybe that could interfere. Haven't thought about that. About times and dates: er... I am storing them separately. Do you mean the problem is storing them in different tables, or even in diferent columns? (If the latter, had I merged it all in a column, I wouldn't be able to use an index to filter my data by time) – Lucas Lima Jun 11 '18 at 17:43
  • @LucasLima - `0` vs `NULL` -- lots of either would have the same effect -- to confuse the Optimizer as to the "cardinality" of that column. Time and Date -- two columns usually causes problems. Give an example of how you need to "filter by time". – Rick James Jun 12 '18 at 01:50
  • If I want to see any behaviour of users who buy their tickets after 6PM, ie night time, or between noon and 2 PM, ie lunch time, behavior of those who travel before dawn, etc. Time is really important to us, since it is a data warehouse. Didn't know `0`s could mess with the cardinality as much as `NULL` values. Even so, then, the usage of the date function does not impede the usage of the index, as shown by the first two; it is rather a cardinality issue, then, right? – Lucas Lima Jun 12 '18 at 18:03
  • I think the cardinality test treats `NULL` as a distinct value, and any other frequent value (such as 0) likewise. A column with only Yes and No would have cardinality=2 even if 90% yes and 10% no. This leads to a problem: An index is usable for No, but not for Yes. (It is more complex than that.) – Rick James Jun 12 '18 at 19:10