Refactoring a tsql view which uses row_number() to return rows with a unique column value

Question

I have a sql view, which I'm using to retrieve data. Lets say its a large list of products, which are linked to the customers who have bought them. The view should return only one row per product, no matter how many customers it is linked to. I'm using the row_number function to achieve this. (This example is simplified, the generic situation would be a query where there should only be one row returned for each unique value of some column X. Which row is returned is not important)

CREATE VIEW productView AS
SELECT * FROM 
    (SELECT 
        Row_number() OVER(PARTITION BY products.Id ORDER BY products.Id) AS product_numbering,
        customer.Id
        //various other columns
    FROM products
    LEFT OUTER JOIN customer ON customer.productId = prodcut.Id
    //various other joins
    ) as temp
WHERE temp.prodcut_numbering = 1

Now lets say that the total number of rows in this view is ~1 million, and running select * from productView takes 10 seconds. Performing a query such as select * from productView where productID = 10 takes the same amount of time. I believe this is because the query gets evaluated to this

SELECT * FROM 
    (SELECT 
        Row_number() OVER(PARTITION BY products.Id ORDER BY products.Id) AS product_numbering,
        customer.Id
        //various other columns
    FROM products
    LEFT OUTER JOIN customer ON customer.productId = prodcut.Id
    //various other joins
    ) as temp
WHERE prodcut_numbering = 1 and prodcut.Id = 10

I think this is causing the inner subquery to be evaluated in full each time. Ideally I'd like to use something along the following lines

SELECT 
    Row_number() OVER(PARTITION BY products.productID ORDER BY products.productID) AS product_numbering,
    customer.id
    //various other columns
FROM products
    LEFT OUTER JOIN customer ON customer.productId = prodcut.Id
    //various other joins
WHERE prodcut_numbering = 1

But this doesn't seem to be allowed. Is there any way to do something similar?

EDIT -

After much experimentation, the actual problem I believe I am having is how to force a join to return exactly 1 row. I tried to use outer apply, as suggested below. Some sample code.

CREATE TABLE Products (id int not null PRIMARY KEY)
CREATE TABLE Customers (
        id int not null PRIMARY KEY,
        productId int not null,
        value varchar(20) NOT NULL)

declare @count int = 1
while @count <= 150000
begin
        insert into Customers (id, productID, value)
        values (@count,@count/2, 'Value ' + cast(@count/2 as varchar))      
        insert into Products (id) 
        values (@count)
        SET @count = @count + 1
end

CREATE NONCLUSTERED INDEX productId ON Customers (productID ASC)

With the above sample set, the 'get everything' query below

select * from Products
outer apply (select top 1 * 
            from Customers
            where Products.id = Customers.productID) Customers

takes ~1000ms to run. Adding an explicit condition:

select * from Products
outer apply (select top 1 * 
            from Customers
            where Products.id = Customers.productID) Customers
where Customers.value = 'Value 45872'

Takes some identical amount of time. This 1000ms for a fairly simple query is already too much, and scales the wrong way (upwards) when adding additional similar joins.

Do you need actual customer details or just existence of or just a customerid? The subquery is evaluatde because the "10" isn't known in advance. And you are asking for the 10th row exactly. Hence my first question about desired output — gbn, Oct 18 '11 at 11:37
Really good observation - SQL isn't able to apply the view filter into the subquery. Do you really need the flexibility of the view? If you used a SPROC or a table valued function with 'defined' filters (ProductID in your example), you could build in the filter into the subquery. And in the case where your PARTITION BY and the FILTER are the same (ProductId), you wouldn't need the PARTITION at all - so SELECT TOP 1 should be sufficient. — StuartLC, Oct 18 '11 at 12:09
I do need the actual customer details (or null values if none exists), not just the existance of one. I also have to use a view, refactoring the app which retrieves the data isn't possible. — John, Oct 18 '11 at 12:56
@John, this doesn't make sense to me; in the post you say you are getting 1 row per product regardless of how many customers is it tied to. So, it seems to me you wouldn't even need to join to the customers table. Am I missing something? — Parmenion, Oct 27 '11 at 15:30
The example is simplified, imagine that I need the details from a customer, it doesn't matter which one i get as long as I get some details. It sounds very odd, it's the way of the system I'm working with (not my design!). — John, Oct 28 '11 at 09:46

score 3 · Answer 1 · answered Oct 24 '11 at 18:06

3

Try the following approach, using a Common Table Expression (CTE). With the test data you provided, it returns specific ProductIds in less than a second.

create view ProductTest as 

with cte as (
select 
    row_number() over (partition by p.id order by p.id) as RN, 
    c.*
from 
    Products p
    inner join Customers c
        on  p.id = c.productid
)

select * 
from cte
where RN = 1
go

select * from ProductTest where ProductId = 25

answered Oct 24 '11 at 18:06

Derek

21,828
7
53
61

This does seem to work much faster than other methods, but it still causes the entire subquery to be evaluated. Performing `select * from ProductTest` alone takes roughly the same time, and has the same execution plan, as it does with the where clause. – John Oct 28 '11 at 09:48
I think this is the best you're going to be able to get due to the nature of views themselves. The other option is to create a stored procedure, or maybe a table-valued function, that passes in the productid and may be able to do the filtering directly on the portion of the query you desire. – Derek Oct 28 '11 at 12:06

score 2 · Answer 2 · answered Oct 18 '11 at 21:27

2

What if you did something like:

SELECT ...
FROM products
OUTER APPLY (SELECT TOP 1 * from customer where customerid = products.buyerid) as customer
...

Then the filter on productId should help. It might be worse without filtering, though.

answered Oct 18 '11 at 21:27

GilM

3,711
17
18

score 1 · Answer 3 · answered Oct 29 '11 at 19:36

The problem is that your data model is flawed. You should have three tables:

Customers (customerId, ...)
Products (productId,...)
ProductSales (customerId, productId)

Furthermore, the sale table should probably be split into 1-to-many (Sales and SalesDetails). Unless you fix your data model you're just going to run circles around your tail chasing red-herring problems. If the system is not your design, fix it. If the boss doesn't let your fix it, then fix it. If you cannot fix it, then fix it. There isn't a easy way out for the bad data model you're proposing.

score 0 · Answer 4 · answered Jun 04 '14 at 17:09

this will probably be fast enough if you really don't care which customer you bring back

select p1.*, c1.*
FROM products p1
Left Join (
        select p2.id, max( c2.id) max_customer_id
        From product p2
        Join customer c2 on
        c2.productID = p2.id
        group by 1
) product_max_customer
Left join customer c1 on
c1.id = product_max_customer.max_customer_id
;

Refactoring a tsql view which uses row_number() to return rows with a unique column value

4 Answers4