9

I need to store simple data - suppose I have some products with codes as a primary key, some properties and validity ranges. So data could look like this:

Products
code    value   begin_date  end_date
10905   13      2005-01-01  2016-12-31
10905   11      2017-01-01  null

Those ranges are not overlapping, so on every date I have a list of unique products and their properties. So to ease the use of it I've created the function:

create function dbo.f_Products
(
    @date date
)
returns table
as
return (
    select
    from dbo.Products as p
    where
        @date >= p.begin_date and
        @date <= p.end_date
)

This is how I'm going to use it:

select
    *
from <some table with product codes> as t
    left join dbo.f_Products(@date) as p on
        p.code = t.product_code

This is all fine, but how I can let optimizer know that those rows are unique to have better execution plan?

I did some googling, and found a couple of really nice articles for DDL which prevents storing overlapping ranges in the table:

But even if I try those constraint I see that optimizer cannot understand that resulting recordset will return unique codes.

What I'd like to have is certain approach which gives me basically the same performance as if I stored those products list on certain date and selected it with date = @date.

I know that some RDMBS (like PostgreSQL) have special data types for this (Range Types). But SQL Server doesn't have anything like this.

Am I missing something or there're no way to do this properly in SQL Server?

Roman Pekar
  • 107,110
  • 28
  • 195
  • 197
  • 6
    If you care about performance, don't use a UDF. Just join to the table directly. – Gordon Linoff Nov 10 '16 at 16:48
  • Don't know about sql server, but tree sets are my usual go to when storing information in non-overlapping intervals. – code11 Nov 10 '16 at 16:48
  • OT: your `@date <= p.end_date` together with `end_date`s like `2016-01-01` might get you into troubles if you ever have to deal with time portions... – Shnugo Nov 10 '16 at 16:51
  • 3
    @GordonLinoff I disagree with this, but this is not the case here, so let's not go into heated discussion – Roman Pekar Nov 10 '16 at 16:51
  • @Shnugo yes, I should've mentioned that dates are `date` type, not `datetime`. However, this is not the point here. – Roman Pekar Nov 10 '16 at 16:52
  • Which version of SQL-Server? If you can use `LAG` and `LEAD` an index on `begin_date` and **no `end_date`** would be the fastest I can think of... Use `TOP 1` together with `WHERE` and `ORDER BY begin_date DESC`... – Shnugo Nov 10 '16 at 16:54
  • it's 2008 R2, thanks, forgot to mention. So `LAG` and `LEAD` are not the options, sadly – Roman Pekar Nov 10 '16 at 16:56
  • Do you expect gaps? – Shnugo Nov 10 '16 at 16:57
  • If you have a good solution for a case without gaps I can use this and create dummy records for gaps, so doesn't really matter – Roman Pekar Nov 10 '16 at 17:00
  • 1
    Just a random thought: INDEX on `begin_date` with an include of `end_date`, together with a `SELECT TOP 1 ...` in the UDF? Does that produce a better execution plan? – TT. Nov 10 '16 at 17:03
  • I was told that a function call can never be inlined for optimization. Is there a good reason for not using BETWEEN? – WGroleau Nov 10 '16 at 17:17
  • @WGroleau 1) Why not `BETWEEN`: [Read this](https://sqlblog.org/2009/10/16/bad-habits-to-kick-mis-handling-date-range-queries). 2) There are *normal* table valued functions, which are really bad (multi-statement with `BEGIN` and `END`), and inlineable TVF, which are - tataa! - inlined (real *ad-hoc*, single-statement). They can be seen as parameterized VIEWs. 3) And - might be - that you were actually referring to [sargability](http://www.sqlservercentral.com/Forums/Topic1541117-1292-1.aspx) – Shnugo Nov 10 '16 at 17:35
  • 2
    "Those ranges are not overlapping" -- except they *are*. Your second row should presumably start `2017-01-01`. – Jeroen Mostert Feb 13 '18 at 13:53
  • @JeroenMostert stupid me, thanks! – Roman Pekar Feb 13 '18 at 13:58
  • 2
    I think you're barking up the wrong tree with your desire for somehow telling the optimizer your result rows are unique. You should be much more concerned about getting it to efficiently retrieve the rows you're interested in in the first place. (No, SQL Server still has no separate support for ranges as of 2018.) Without the "knowledge" of uniqueness, the optimizer will decide the join type based on the cardinality of the other table you're joining with, which should be fine. `CREATE UNIQUE CLUSTERED INDEX IX_Products ON products([code], [begin_date], [end_date])` should be all you need... – Jeroen Mostert Feb 13 '18 at 14:59
  • ...and where it's not, materializing results in a temp table or applying an explicit `JOIN` hint should get the query up to speed. Either should be more effective than wanting a uniqueness override/flag/toggle somewhere. If you really think otherwise, you can [suggest a feature enhancement](https://feedback.azure.com/forums/908035-sql-server). But I've done some testing with various TVFs, and as far as I can tell SQL Server will "do the right thing" when it inlines the TVF in the query, with not much room for improvement by assuming one side has fewer rows. – Jeroen Mostert Feb 13 '18 at 15:01
  • Well, usually it performs well enough. But it's nice to have joins-elimination for nested functions, so optimizer can build better plan for queries where some columns are not used. – Roman Pekar Feb 13 '18 at 15:07
  • 1
    Your current query will not cater for cases where your \@date is > than the last begin_date. I would create a clustered primary key on code, begin_date and just query using WHERE code = 10905 AND begin_date <= \@Date AND \@date <= COALESCE(end_date, \@date). This uses an index seek. – Steve Ford Feb 14 '18 at 10:07
  • Why don't you declare a variable of type int (I assume your id is of type int) and get only that with your function, or even put the function in the WHERE clause, then eliminate the join altogether since you only use values from T and not from P? You did state that the ranges do not overlap, so you should have only 1 id as a result of your function. Surely you will have appropriate indexes to make the 2 individual queries extremely fast. – user7396598 Feb 16 '18 at 00:07
  • Why can't you simply use an INNER or OUTER join? I deal with similar data structure\*, use trivial join queries and never ran into performance issues. (\* Actually I have to search ranges for ranges i.e. [@Date1, @Date2] overlap [begin, end] dates). – Salman A Feb 16 '18 at 22:45
  • Could you be a bit more specific about the `Performance Issue` you are having? You have explained the scenario very well and vaguely defined the problem that you may or may not have, basically could you add some problematic execution plan that you would like to improve?? – M.Ali Feb 18 '18 at 09:40
  • Have a look at [similar question](https://stackoverflow.com/q/47798532/87015) where user wants to optimize `WHERE @variable BETWEEN two columns`. It is tagged MySQL but many concepts apply here. – Salman A Feb 18 '18 at 15:47
  • If you get rid of this table-valued function, you will get 10x performance right there. It is pretty much the last thing you want to use in SQL Server – Ruslan Tolkachev Feb 20 '18 at 01:01
  • 1
    This is a typical use case of interval query. It can be efficiently solved by using the RI-Tree model. Please see http://www.itprotoday.com/software-development/interval-queries-sql-server by Itzik Ben-Gan . – Jesús López Feb 20 '18 at 13:22
  • @JesúsLópez I knew there're should be some great work by Itzik Ben-Gan, going to read it now, thanks! – Roman Pekar Feb 20 '18 at 13:38

5 Answers5

1

EDIT: My original answer was using an INNER JOIN, but the questioner wanted a LEFT JOIN.

CREATE TABLE Products
  (
  [Code] INT NOT NULL
  , [Value] VARCHAR(30) NOT NULL
  , Begin_Date DATETIME NOT NULL
  , End_Date DATETIME NULL
  )

/*
Products
code    value   begin_date  end_date
10905   13      2005-01-01  2016-12-31
10905   11      2017-01-01  null
*/
INSERT INTO Products ([Code], [Value], Begin_Date, End_Date) VALUES (10905, 13, '2005-01-01', '2016-12-31')
INSERT INTO Products ([Code], [Value], Begin_Date, End_Date) VALUES (10905, 11, '2017-01-01', NULL)

CREATE NONCLUSTERED INDEX SK_ProductDate ON Products ([Code], Begin_Date, End_Date) INCLUDE ([Value])

CREATE TABLE SomeTableWithProductCodes
  (
  [CODE] INT NOT NULL 
  )

 INSERT INTO SomeTableWithProductCodes ([Code]) VALUES (10905)

Here is a prototypical query, with a date predicate. Note that there are more optimal ways to do this in a bulletproof fashion, using a "less than" operator on the upper bound, but that's a different discussion.

SELECT
  P.[Code]
  , P.[Value]
  , P.[Begin_Date]
  , P.[End_Date]
FROM
   SomeTableWithProductCodes ST
   LEFT JOIN Products AS P ON
     ST.[Code] = P.[Code]
     AND '2016-06-30' BETWEEN P.[Begin_Date] AND ISNULL(P.[End_Date], '9999-12-31')

This query will perform an Index Seek on the Product table.

Here is a SQL Fiddle: SQL Fiddle - Products and Dates

Query Execution Plan

Pittsburgh DBA
  • 6,672
  • 2
  • 39
  • 68
1

You can create an indexed view that contains a row for each code/date in the range.

ProductDate (indexed view)
code    value   date
10905   13      2005-01-01
10905   13      2005-01-02
10905   13      ...
10905   13      2016-12-31
10905   11      2017-01-01
10905   11      2017-01-02
10905   11      ...
10905   11      Today

Like this:

create schema digits
go

create table digits.Ones (digit tinyint not null primary key)
insert into digits.Ones (digit) values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)

create table digits.Tens (digit tinyint not null primary key)
insert into digits.Tens (digit) values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)

create table digits.Hundreds (digit tinyint not null primary key)
insert into digits.Hundreds (digit) values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)

create table digits.Thousands (digit tinyint not null primary key)
insert into digits.Thousands (digit) values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)

create table digits.TenThousands (digit tinyint not null primary key)
insert into digits.TenThousands (digit) values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)
go

create schema info
go

create table info.Products (code int not null, [value] int not null, begin_date date not null, end_date date null, primary key (code, begin_date))
insert into info.Products (code, [value], begin_date, end_date) values 
(10905, 13, '2005-01-01', '2016-12-31'),
(10905, 11, '2017-01-01', null)

create table info.DateRange ([begin] date not null, [end] date not null, [singleton] bit not null default(1) check ([singleton] = 1))
insert into info.DateRange ([begin], [end]) values ((select min(begin_date) from info.Products), getdate())
go

create view info.ProductDate with schemabinding 
as
select
    p.code,
    p.value,
    dateadd(day, ones.digit + tens.digit*10 + huns.digit*100 + thos.digit*1000 + tthos.digit*10000, dr.[begin]) as [date]
from
    info.DateRange as dr
cross join
    digits.Ones as ones
cross join
    digits.Tens as tens
cross join
    digits.Hundreds as huns
cross join
    digits.Thousands as thos
cross join
    digits.TenThousands as tthos
join
    info.Products as p on
    dateadd(day, ones.digit + tens.digit*10 + huns.digit*100 + thos.digit*1000 + tthos.digit*10000, dr.[begin]) between p.begin_date and isnull(p.end_date, datefromparts(9999, 12, 31))
go

create unique clustered index idx_ProductDate on info.ProductDate ([date], code)
go

select *
from info.ProductDate with (noexpand)
where 
    date = '2014-01-01'

drop view info.ProductDate
drop table info.Products
drop table info.DateRange
drop table digits.Ones
drop table digits.Tens
drop table digits.Hundreds
drop table digits.Thousands
drop table digits.TenThousands
drop schema digits
drop schema info
go
Aducci
  • 26,101
  • 8
  • 63
  • 67
  • That one is good, i was thinking about this myself, the problem is that for long periods it will take a lot of space – Roman Pekar Feb 17 '18 at 09:31
  • @RomanPekar What are you trying to accomplish, exactly? Do you want all product and date compbinations, with no gaps, or do you simply want to efficiently query the table you already have, which is Code, Value, Begin_Date, and End_Date? If it's the latter, these solutions are getting out of control. If it's the former, you need to edit your question to make this clear, because it is not. – Pittsburgh DBA Feb 17 '18 at 13:11
  • @RomanPekar - I can't imagine it taking up that much space...how many products are there? What is the average time span? – Aducci Feb 18 '18 at 21:19
  • @RomanPekar search for "space-time tradeoff". You cannot minimize both at the same time. – Salman A Feb 19 '18 at 07:08
  • @Aducci If we want to make it easy-to-use for end user, then it should work with, basically, any date passed (periods could be open-ended). Well, let's say there're 6000 products. If we use time from, let's say, 1990-01-01 till 2100-01-01 it will be 110 * 365 * 6000, which is 219M rows already.And it's still doesn't work with any date (for example 9999-12-31). So it's a fine working solution, but good 'range' solution is still better. – Roman Pekar Feb 19 '18 at 11:56
  • @RomanPekar you could wrap the view with an inline table function in order to reduce the number of rows you need to store...using a where clause similar to `where [date] = case when @date > getdate() then getdate() else @date end` – Aducci Feb 19 '18 at 16:22
0

A solution without gaps might be this:

DECLARE @tbl TABLE(ID INT IDENTITY,[start_date] DATE);
INSERT INTO @tbl VALUES({d'2016-10-01'}),({d'2016-09-01'}),({d'2016-08-01'}),({d'2016-07-01'}),({d'2016-06-01'});

SELECT * FROM @tbl;

DECLARE @DateFilter DATE={d'2016-08-13'};

SELECT TOP 1 * 
FROM @tbl
WHERE [start_date]<=@DateFilter
ORDER BY [start_date] DESC

Important: Be sure that there is an (unique) index on start_date

UPDATE: for different products

DECLARE @tbl TABLE(ID INT IDENTITY,ProductID INT,[start_date] DATE);
INSERT INTO @tbl VALUES
--product 1
(1,{d'2016-10-01'}),(1,{d'2016-09-01'}),(1,{d'2016-08-01'}),(1,{d'2016-07-01'}),(1,{d'2016-06-01'})
--product 1
,(2,{d'2016-10-17'}),(2,{d'2016-09-16'}),(2,{d'2016-08-15'}),(2,{d'2016-07-10'}),(2,{d'2016-06-11'});

DECLARE @DateFilter DATE={d'2016-08-13'};

WITH PartitionedCount AS
(
    SELECT ROW_NUMBER() OVER(PARTITION BY ProductID ORDER BY [start_date] DESC) AS Nr
          ,*
    FROM @tbl
    WHERE [start_date]<=@DateFilter
)
SELECT *
FROM PartitionedCount
WHERE Nr=1
Community
  • 1
  • 1
Shnugo
  • 66,100
  • 9
  • 53
  • 114
  • Yes, this is nice way of solving the task. However, the performance is still not optimal. What I'd like to have is certain approach which gives me basically the same performance as if I stored those products list on certain date and selected it with `date = @date`. – Roman Pekar Nov 10 '16 at 17:06
  • @RomanPekar There must be a difference whether you grab into an index based on equality or if you have to locate a value within intervalls. Nevertheless: If there is an index (check the EP if it is used!) the `WHERE [start_date]<=@DateFilter` should be lightning fast and the `TOP 1 ORDER BY [start_date] DESC` should pick the relevant row instantly... It makes me wonder, that you do not observe better performance... – Shnugo Nov 10 '16 at 17:09
  • well, for one row it works fine, but what if I want to get the list of valid products on certain date? I can of course explode my table into dates and use equality check, but I really would like to know if there's more beautiful solution – Roman Pekar Nov 10 '16 at 17:12
  • @RomanPekar If you have the same start-dates (first of month) for all your products, simply use `SELECT TOP 1 WITH TIES`... Does this help? – Shnugo Nov 10 '16 at 17:16
  • @RomanPekar I just added an approach using `ROW_NUMBER` which would get all products in one go. Just join this resultset to your main query... – Shnugo Nov 10 '16 at 17:20
  • @RomanPekar Is this question solved? Do you need further help? Please allow me one hint: If this question is solved, it would be very kind of you, to tick the acceptance check below the (best) answer's vote counter. This will 1) mark this issue as solved 2) make it easier for followers to find the best solution 3) pay points to the answerer and 4) pay points to you. Since you've crossed the 15 points border yourself, you are - additionally - asked to vote on contributions. This is the SO-way to say thank you. Happy Coding! – Shnugo Nov 17 '16 at 07:51
0

First you need to create a unique clustered index for (begin_date, end_date, code)

Then SQL engine will be able to do INDEX SEEK.

Additionally, you can also try to create a view for dbo.Products table to join that table with pre-populated dbo.Dates table.

select p.code, p.val, p.begin_date, p.end_date, d.[date]
    from dbo.Product as p
        inner join dbo.dates d on p.begin_date <= d.[date] and d.[date] <= p.end_date

Then in your function, you use that view as "where @date = view.date". The result can be either better or slightly worse... it depends on the actual data.

You also can try to make that view indexed (depends on how often it is being updated).

Alternatively, you can have better performance if you populate dbo.Products table for every date in the [begin_date] .. [end_date] range.

Anton
  • 2,846
  • 1
  • 10
  • 15
0

Approach with ROW_NUMBER scans the whole Products table once. It is the best method if you have a lot of product codes in the Products table and few validity ranges for each code.

WITH
CTE_rn
AS
(
    SELECT
        code
        ,value
        ,ROW_NUMBER() OVER (PARTITION BY code ORDER BY begin_date DESC) AS rn
    FROM Products
    WHERE begin_date <= @date
)
SELECT *
FROM
    <some table with product codes> as t
    LEFT JOIN CTE_rn ON CTE_rn.code = t.product_code AND CTE_rn.rn = 1
;

If you have few product codes and a lot of validity ranges for each code in the Products table, then it is better to seek the Products table for each code using OUTER APPLY.

SELECT *
FROM
    <some table with product codes> as t
    OUTER APPLY
    (
        SELECT TOP(1)
            Products.value
        FROM Products
        WHERE
            Products.code = t.product_code
            AND Products.begin_date <= @date
        ORDER BY Products.begin_date DESC
    ) AS A
;

Both variants need unique index on (code, begin_date DESC) include (value).

Note how the queries don't even look at end_date, because they assume that intervals don't have gaps. They will work in SQL Server 2008.

Vladimir Baranov
  • 31,799
  • 5
  • 53
  • 90