2

I have a table that uses "start dates" or effective dates. The values in the table take effect from the start date onward, until it is overridden by another entry from the same table with a later start date.

My schema and sample data:

CREATE TABLE VALUE_DATA (
    `start_date` DATE,
    `value` FLOAT
);

INSERT INTO VALUE_DATA (start_date, value) VALUES
('2015-01-01', 10),
('2015-01-03', 20),
('2015-01-08', 30),
('2015-01-09', 15);

Query that produces required results:

SELECT date, value
FROM(
    SELECT date, MAX(start_date) as max_start
    FROM (
        select curdate() - INTERVAL (ones.digit + (10 * tens.digit) + (100 * hundreds.digit)) DAY as date
        from (select 0 as digit union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) as ones
        cross join (select 0 as digit union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) as tens
        cross join (select 0 as digit union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) as hundreds
    ) DATE_TABLE
    LEFT JOIN VALUE_DATA ON (DATE_TABLE.date >= VALUE_DATA.start_date)
    WHERE DATE_TABLE.date between '2015-01-01' and '2015-01-10'
    GROUP BY date
) START_DATES LEFT JOIN VALUE_DATA ON (START_DATES.max_start = VALUE_DATA.start_date);

I've created an SQL Fiddle to simulate the problem.

Although the SQL Fiddle works (gives correct results), I'm not convinced that it's the best way to do this. The query I had to use is a bit convoluted. I'd ultimately like to create a view for this table that contains the correct value for every day regardless of whether it falls on a start date or not (like the output produced by the Fiddle) to make it easier to join to this table. Obviously, I want to make sure this view is as fast as possible. So my question is, how can this query be improved (optimized) for use in such a view?

Bulat
  • 6,869
  • 1
  • 29
  • 52
mkasberg
  • 16,022
  • 3
  • 42
  • 46
  • 3
    If you really need this, then add a calendar table to your database. – Gordon Linoff Aug 11 '15 at 14:40
  • It seems that the only way you can improve your query is by taking `DATE_TABLE` logic else where. You can create a table that will have all dates required by your domain (date dimension). It will simplify the query – Bulat Aug 11 '15 at 14:44
  • 1
    What do you do with the result set? – Strawberry Aug 11 '15 at 14:58
  • I would be using the result set to build a view, which would be JOINed on date into other queries. Something like `CREATE VIEW ... AS SELECT date, value ...`. I would need this view to be optimized to join on the date field. – mkasberg Aug 11 '15 at 15:28

2 Answers2

2

You need to be very careful with this type of view. It will be easy to write a view that is good at giving all the individual dates that each record is valid for, but slow when asking which record is valid on one specific date.

(Because to answer the second question involves answering the first question for Each and Every date first, then discarding the failures.)

The following is reasonable at taking a date and returning the rows valid on that date.

CREATE VIEW DAILY_VALUE_DATA AS (
    SELECT
        DATE_TABLE.date,
        VALUE_TABLE.value
    FROM
        DATE_TABLE
    LEFT JOIN
        VALUE_DATA
            ON  VALUE_DATA.start_date = (SELECT MAX(lookup.start_date)
                                           FROM VALUE_DATA lookup
                                          WHERE lookup.start_date <= DATE_TABLE.date
                                        )
);

SELECT * FROM DAILY_VALUE_DATA WHERE date = '2015-08-11'

Note: This assumes DateTable is a real persistent materialised table, not the in-line view you used, use of which will greatly compromise performance.

It also assumes that VALUE_DATA is indexed by the start_date.


EDIT:

I also find it likely that your value table will likely have other columns. Let's say that it is a value per person. Maybe their address on any given date.

To extend the query above you then also need to join on the person table...

CREATE VIEW DAILY_VALUE_DATA AS (
    SELECT
        PERSON.id   AS person_id,
        DATE_TABLE.date,
        VALUE_TABLE.value
    FROM
        PERSON
    INNER JOIN
        DATE_TABLE
            ON  DATE_TABLE.date >=          PERSON.date_of_birth
            AND DATE_TABLE.date <  COALESCE(PERSON.date_of_death, CURDATE() + 1)
    LEFT JOIN
        VALUE_DATA
            ON  VALUE_DATA.start_date = (SELECT MAX(lookup.start_date)
                                           FROM VALUE_DATA lookup
                                          WHERE lookup.start_date <= DATE_TABLE.date
                                            AND lookup.person_id   = PERSON.id
                                        )
);

SELECT * FROM DAILY_VALUE_DATA WHERE person_id = 1 AND date = '2015-08-11'


EDIT:

Another alternative to the LEFT JOIN is to embed the correllated sub-query in the SELECT block. This is effective when you only have one value to pull from the target table, but less effective if you need to pull many values from the target table...

CREATE VIEW DAILY_VALUE_DATA AS (
    SELECT
        PERSON.id   AS person_id,
        DATE_TABLE.date,
        (SELECT VALUE_DATA.value
           FROM VALUE_DATA
          WHERE VALUE_DATA.start_date <= DATE_TABLE.date
            AND VALUE_DATA.person_id   = PERSON.id
       ORDER BY VALUE_DATA.start_date DESC
          LIMIT 1
        )   AS value
    FROM
        PERSON
    INNER JOIN
        DATE_TABLE
            ON  DATE_TABLE.date >=          PERSON.date_of_birth
            AND DATE_TABLE.date <  COALESCE(PERSON.date_of_death, CURDATE() + 1)
);

SELECT * FROM DAILY_VALUE_DATA WHERE person_id = 1 AND date = '2015-08-11'
MatBailie
  • 83,401
  • 18
  • 103
  • 137
  • 1
    Final note; This view encapsulate the logic you don't want to have to deal with elsewhere. But that in it self incurs an overhead. There is no way to make such a view as effective as embedding the join conditions in your own query. At the very least because it involves a join on the calendar table that would not otherwise be required. You appear to have choices; your SQL developers need to learn how to write these correlated sub-query joins, or you need a different data model, or you need an ETL process to create a persisted daily fact table, or you need to accept the performance hit. – MatBailie Aug 11 '15 at 15:46
1

I would approach this in two steps.

First you need to bring the end of the period to each record, which will turn your rows from events to periods:

SELECT 
  v1.start_date, 
  v2.start_date as next_start_date, 
  v1.value 
FROM 
  VALUE_DATA v1 LEFT JOIN 
  VALUE_DATA v2 ON 
    v1.start_date < v2.start_date AND 
    NOT EXISTS 
     (SELECT * FROM VALUE_DATA 
      WHERE start_date > v1.start_date and start_date < v2.start_date)

Now you need to add date dimension to your schema.

Once you have date dimension, it is easy to join it to previous query:

SELECT 
  d.date, v1.value 
FROM 
  VALUE_DATA v1 LEFT JOIN 
  VALUE_DATA v2 ON 
    v1.start_date < v2.start_date AND 
    NOT EXISTS 
     (SELECT * FROM VALUE_DATA 
      WHERE start_date > v1.start_date and start_date < v2.start_date)
  INNER JOIN DATE_DIMENSION d ON 
      d.date >= v1.start_date AND d.date < COALESCE(v2.start_date, CURDATE()) 

It is possible that the following query will be faster in MySQL, it sort of derives from the first two, just instead of using JOIN you find next start_date within a subquery:

SELECT 
  d.date, v1.value 
FROM 
  VALUE_DATA v1 
  INNER JOIN DATE_DIMENSION d ON 
      d.date >= v1.start_date AND 
      d.date < (SELECT COALESCE(MIN(v.start_date), CURDATE()) 
                FROM VALUE_DATA v 
                WHERE v.start_date > v1.start_date);
Bulat
  • 6,869
  • 1
  • 29
  • 52
  • 1
    This will have very poor performance when querying the view for a specific date. It's good in the other direction *(starting with a record, which dates is this record valid on?)* but there are many non-SARGable statements when asking starting with a date and determining which records are valid for that date. – MatBailie Aug 11 '15 at 15:01
  • I have added option based on sub-query with an aggregate for finding next value instead of `NON EXISTS`. – Bulat Aug 11 '15 at 16:58