51

I have the following two data structures.

First, a list of properties applied to object triples:

Object1  Object2  Object3 Property  Value
     O1       O2       O3       P1  "abc"
     O1       O2       O3       P2  "xyz"
     O1       O3       O4       P1  "123"
     O2       O4       O5       P1  "098"

Second, an inheritance tree:

O1
    O2
        O4
    O3
        O5

Or viewed as a relation:

Object    Parent
    O2        O1
    O4        O2
    O3        O1
    O5        O3
    O1      null

The semantics of this being that O2 inherits properties from O1; O4 - from O2 and O1; O3 - from O1; and O5 - from O3 and O1, in that order of precedence.
NOTE 1: I have an efficient way to select all children or all parents of a given object. This is currently implemented with left and right indexes, but hierarchyid could also work. This does not seem important right now.
NOTE 2: I have tiggers in place that make sure that the "Object" column always contains all possible objects, even when they do not really have to be there (i.e. have no parent or children defined). This makes it possible to use inner joins rather than severely less effiecient outer joins.

The objective is: Given a pair of (Property, Value), return all object triples that have that property with that value either defined explicitly or inherited from a parent.

NOTE 1: An object triple (X,Y,Z) is considered a "parent" of triple (A,B,C) when it is true that either X = A or X is a parent of A, and the same is true for (Y,B) and (Z,C).
NOTE 2: A property defined on a closer parent "overrides" the same property defined on a more distant parent.
NOTE 3: When (A,B,C) has two parents - (X1,Y1,Z1) and (X2,Y2,Z2), then (X1,Y1,Z1) is considered a "closer" parent when:
(a) X2 is a parent of X1, or
(b) X2 = X1 and Y2 is a parent of Y1, or
(c) X2 = X1 and Y2 = Y1 and Z2 is a parent of Z1

In other words, the "closeness" in ancestry for triples is defined based on the first components of the triples first, then on the second components, then on the third components. This rule establishes an unambigous partial order for triples in terms of ancestry.

For example, given the pair of (P1, "abc"), the result set of triples will be:

 O1, O2, O3     -- Defined explicitly
 O1, O2, O5     -- Because O5 inherits from O3
 O1, O4, O3     -- Because O4 inherits from O2
 O1, O4, O5     -- Because O4 inherits from O2 and O5 inherits from O3
 O2, O2, O3     -- Because O2 inherits from O1
 O2, O2, O5     -- Because O2 inherits from O1 and O5 inherits from O3
 O2, O4, O3     -- Because O2 inherits from O1 and O4 inherits from O2
 O3, O2, O3     -- Because O3 inherits from O1
 O3, O2, O5     -- Because O3 inherits from O1 and O5 inherits from O3
 O3, O4, O3     -- Because O3 inherits from O1 and O4 inherits from O2
 O3, O4, O5     -- Because O3 inherits from O1 and O4 inherits from O2 and O5 inherits from O3
 O4, O2, O3     -- Because O4 inherits from O1
 O4, O2, O5     -- Because O4 inherits from O1 and O5 inherits from O3
 O4, O4, O3     -- Because O4 inherits from O1 and O4 inherits from O2
 O5, O2, O3     -- Because O5 inherits from O1
 O5, O2, O5     -- Because O5 inherits from O1 and O5 inherits from O3
 O5, O4, O3     -- Because O5 inherits from O1 and O4 inherits from O2
 O5, O4, O5     -- Because O5 inherits from O1 and O4 inherits from O2 and O5 inherits from O3

Note that the triple (O2, O4, O5) is absent from this list. This is because property P1 is defined explicitly for the triple (O2, O4, O5) and this prevents that triple from inheriting that property from (O1, O2, O3). Also note that the triple (O4, O4, O5) is also absent. This is because that triple inherits its value of P1="098" from (O2, O4, O5), because it is a closer parent than (O1, O2, O3).

The straightforward way to do it is the following. First, for every triple that a property is defined on, select all possible child triples:

select Children1.Id as O1, Children2.Id as O2, Children3.Id as O3, tp.Property, tp.Value
from TriplesAndProperties tp

-- Select corresponding objects of the triple
inner join Objects as Objects1 on Objects1.Id = tp.O1
inner join Objects as Objects2 on Objects2.Id = tp.O2
inner join Objects as Objects3 on Objects3.Id = tp.O3

-- Then add all possible children of all those objects
inner join Objects as Children1 on Objects1.Id [isparentof] Children1.Id
inner join Objects as Children2 on Objects2.Id [isparentof] Children2.Id
inner join Objects as Children3 on Objects3.Id [isparentof] Children3.Id

But this is not the whole story: if some triple inherits the same property from several parents, this query will yield conflicting results. Therefore, second step is to select just one of those conflicting results:

select * from
(
    select 
        Children1.Id as O1, Children2.Id as O2, Children3.Id as O3, tp.Property, tp.Value,
        row_number() over( 
            partition by Children1.Id, Children2.Id, Children3.Id, tp.Property
            order by Objects1.[depthInTheTree] descending, Objects2.[depthInTheTree] descending, Objects3.[depthInTheTree] descending
        )
        as InheritancePriority
    from
    ... (see above)
)
where InheritancePriority = 1

The window function row_number() over( ... ) does the following: for every unique combination of objects triple and property, it sorts all values by the ancestral distance from the triple to the parents that the value is inherited from, and then I only select the very first of the resulting list of values. A similar effect can be achieved with a GROUP BY and ORDER BY statements, but I just find the window function semantically cleaner (the execution plans they yield are identical). The point is, I need to select the closest of contributing ancestors, and for that I need to group and then sort within the group.

And finally, now I can simply filter the result set by Property and Value.

This scheme works. Very reliably and predictably. It has proven to be very powerful for the business task it implements.

The only trouble is, it is awfuly slow.
One might point out the join of seven tables might be slowing things down, but that is actually not the bottleneck.

According to the actual execution plan I'm getting from the SQL Management Studio (as well as SQL Profiler), the bottleneck is the sorting. The problem is, in order to satisfy my window function, the server has to sort by Children1.Id, Children2.Id, Children3.Id, tp.Property, Parents1.[depthInTheTree] descending, Parents2.[depthInTheTree] descending, Parents3.[depthInTheTree] descending, and there can be no indexes it can use, because the values come from a cross join of several tables.

EDIT: Per Michael Buen's suggestion (thank you, Michael), I have posted the whole puzzle to sqlfiddle here. One can see in the execution plan that the Sort operation accounts for 32% of the whole query, and that is going to grow with the number of total rows, because all the other operations use indexes.

Usually in such cases I would use an indexed view, but not in this case, because indexed views cannot contain self-joins, of which there are six.

The only way that I can think of so far is to create six copies of the Objects table and then use them for the joins, thus enabling an indexed view.
Did the time come that I shall be reduced to that kind of hacks? The despair sets in.

Fyodor Soikin
  • 78,590
  • 9
  • 125
  • 172
  • 3
    I think your relations table is missing an entry for `Object=03`, `Parent=01`. – stakx - no longer contributing Aug 20 '12 at 08:15
  • @stakx: You're right, it is. Fixed that. – Fyodor Soikin Aug 20 '12 at 12:27
  • I also encountered a bottleneck mixing `row_number`ing and recursive cte. Try to materialize the results of windowing routine(e.g. row_number) to an actual table(temporary) before you do a recursive query on them. Example here: http://www.ienablemuch.com/2012/05/recursive-cte-is-evil-and-cursor-is.html – Michael Buen Aug 20 '12 at 13:07
  • @MichaelBuen: I don't do any recursive cycles. – Fyodor Soikin Aug 20 '12 at 13:38
  • @FyodorSoikin Ah.. I thought you are doing one(based on the tree illustration on your question) ツ – Michael Buen Aug 20 '12 at 13:41
  • 1
    Try to post your query(and some data) in http://sqlfiddle.com, so other stackoverfellows shall have something to base or benchmark upon with their formulated queries – Michael Buen Aug 20 '12 at 13:44
  • By any chance, do you devise this: `where InheritancePriority = 1` to weed out the unintended cartesian'd rows? – Michael Buen Aug 20 '12 at 13:47
  • @MichaelBuen: Thank you for the suggestion, I have posted the whole thing to sqlfiddle: http://sqlfiddle.com/#!3/e1465/10/0 – Fyodor Soikin Aug 20 '12 at 21:13
  • @MichaelBuen: Yes, that condition is for weeding out all rows except the one coming from the nearest parent. – Fyodor Soikin Aug 20 '12 at 21:14
  • I mean if there is unintended cartesian'd rows in you results, hence the query need `InheritancePriority=1`? – Michael Buen Aug 20 '12 at 23:34
  • @MichaelBuen: I don't understand this question. – Fyodor Soikin Aug 21 '12 at 00:31
  • Nvm, I'll just have a look on your sqlfiddle – Michael Buen Aug 21 '12 at 00:34
  • @FyodorSoikin All the IDX is 1 even I removed `IDX = 1`, is your query correct? http://sqlfiddle.com/#!3/e1465/14 – Michael Buen Aug 21 '12 at 12:30
  • @MichaelBuen: Yes, the query is correct, but the initial data isn't. Turns out, I screwed up in specifying the objects inheritance. Fixed now, here: http://sqlfiddle.com/#!3/7c7a0/3/0 – Fyodor Soikin Aug 23 '12 at 03:51
  • @FyodorSoikin in your actual data, what is the deepest hierarchy level that you have reached, or what is the possible/logical limit of the levels that can be reached? – Jaguar Aug 23 '12 at 10:58
  • @Jaguar: So far, in the actual production system, the height of the tree is 4 levels. There is, however, no principal limit on the depth, and the user can affect the depth in some cases. Having said that, I don't see how the tree depth affects the problem. – Fyodor Soikin Aug 23 '12 at 13:51
  • @FyodorSoikin it probably doesn't i was just asking for completeness. – Jaguar Aug 23 '12 at 15:31
  • "[...]because indexed views cannot contain self-joins, of which there are six." I believe you will also not be able to make an indexed view because of `row_number()`. – Tim Lehner Aug 24 '12 at 17:24
  • @TimLehner: yes, you are right: indexed views cannot contain window functions either. But this is not necessary. If I can put everything before the window function into an indexed view, then I can create an index that will eliminate the need for sorting, thus making the window function work instantly. – Fyodor Soikin Aug 24 '12 at 19:08
  • 1
    Just one quick question. Is this a real problem or only theoretical? I am maybe to primitive to understand where something like this is needed in real world. Especially as a query in a DB. I would probably try to make the problem easier by going one step back, that is why I ask that way. Then I would also look at data volumes and ask myself if I could load this in memory and make it fast this way. By the way how slow is slow and why does it need to be faster? – hol Aug 25 '12 at 10:36
  • @hol: Yes, it is a very real problem. This system does work in production and implements a business task as we write these comments. The business task is membership-based access control with inherited permissions. Much like you see on, say, Windows file system, only applied to tuples of objects rather than single objects. The description I give here is, of course, simplified compared to the real system, but it captures the essence of the problem. – Fyodor Soikin Aug 25 '12 at 15:48
  • @hol: as far as volumes and slowness. On several hundred of objects that we currently have in the database, the query runs on the order of seconds, which is totally unacceptable. In the future, we anticipate tens of thousands of objects, and because the index-less sorting (which is the bottleneck) takes [n log n], where n is the number of objects to the power of three (for triples), the time is going to grow exponentially. Therefore, I absolutely must come up with a more efficient solution. – Fyodor Soikin Aug 25 '12 at 15:54
  • Thanks for taking your time explaining. That makes the whole problem realer. I am no SQL server expert so I will not answer the question but it sounds to me that your best solution will be to materialize the result set and put an index on it. That moves the time needed from retrieving data to the writing data. – hol Aug 25 '12 at 20:29
  • @hol: yes, that is exactly the solution I have in mind as well: to materialize the set. The problem is, how to do it? I don't want to update denormalized records by hand, because, keeping in mind that the data is non-trivial, that is going to be a major source of bugs. On the other hand, I cannot make the server do it for me because of the limitations it puts on indexed views. – Fyodor Soikin Aug 26 '12 at 05:03

6 Answers6

2

I have 3 possible answers.

The sql fiddle for your question is here: http://sqlfiddle.com/#!3/7c7a0/3/0

The sql fiddle for my answer is here: http://sqlfiddle.com/#!3/5d257/1

Warnings:

  1. Query Analyzer is not enough - I notice a number of answers were rejected because their query plans are more expensive than the original query. The Analyzer is only guide. Depending on the actual data set, hardware, and use case, more expensive queries can return results faster than less expensive ones. You have to test in your environment.
  2. Query Analyzer is ineffective - even if you find a way to remove the "most expensive step" from a query, it often makes no difference to your query.
  3. Query changes alone rarely mitigate schema/design issues - Some answers were rejected because they involve schema level changes such as triggers and additional tables. Complex queries that resist optimization are a strong sign that problem is with the underlying design or with my expectations. You may not like it, but you might have to accept that the problem is not solvable at the query level.
  4. Indexed view cannot contain row_number()/partitition clause - Working around the self-join problem by creating six copies of objects table is not enough to allow you to create the indexed view you've suggested. I tried it in this sqlfiddle. If you uncomment that last "create index" statement, you'll get an error because your view "contains a ranking or aggregate window function."

Working Answers:

  1. Left Join instead of row_number() - You can use a query that uses left joins to exclude results that have been overriden lower in the tree. Removing the finally "order by" from this query actually removes the sort that has been plaguing you! The execution plan for this query is still more expensive than your original, but see Disclaimer #1 above.
  2. Indexed view for part of the query - Using some serious query magic (based on this technique), I created an indexed view for part of the query. This view can be used to enhance the original question query or answer #1.
  3. Actualize into a well indexed table - Someone else suggested this answer, but they might not have explained it well. Unless your result set is very large or you are doing very frequent updates to the source tables, actualizing the results of a query and using a trigger to keep them up-to-date is a perfectly fine way to work around this kind of issue. Once you create a view for your query, it is easy enough to test this option. You can reuse answer #2 to speed up the trigger, and then further improve it over time. (You are talking about creating six copies of your tables, try this first. It guarantees that the performance for the select you care about will be as good as possible.)

Here's is the schema part of my answer from sqlfiddle:

Create Table Objects
(
    Id int not null identity primary key,
    LeftIndex int not null default 0,
    RightIndex int not null default 0
)

alter table Objects add ParentId int null references Objects

CREATE TABLE TP
(
    Object1 int not null references Objects,
    Object2 int not null references Objects,
    Object3 int not null references Objects,
    Property varchar(20) not null,
    Value varchar(50) not null
)


insert into Objects(LeftIndex, RightIndex) values(1, 10)
insert into Objects(ParentId, LeftIndex, RightIndex) values(1, 2, 5)
insert into Objects(ParentId, LeftIndex, RightIndex) values(1, 6, 9)
insert into Objects(ParentId, LeftIndex, RightIndex) values(2, 3, 4)
insert into Objects(ParentId, LeftIndex, RightIndex) values(3, 7, 8)

insert into TP(Object1, Object2, Object3, Property, Value) values(1,2,3, 'P1', 'abc')
insert into TP(Object1, Object2, Object3, Property, Value) values(1,2,3, 'P2', 'xyz')
insert into TP(Object1, Object2, Object3, Property, Value) values(1,3,4, 'P1', '123')
insert into TP(Object1, Object2, Object3, Property, Value) values(2,4,5, 'P1', '098')

create index ix_LeftIndex on Objects(LeftIndex)
create index ix_RightIndex on Objects(RightIndex)
create index ix_Objects on TP(Property, Value, Object1, Object2, Object3)
create index ix_Prop on TP(Property)
GO

---------- QUESTION ADDITIONAL SCHEMA --------
CREATE VIEW TPResultView AS
Select O1, O2, O3, Property, Value
FROM
(
    select Children1.Id as O1, Children2.Id as O2, Children3.Id as O3, tp.Property, tp.Value,

    row_number() over( 
        partition by Children1.Id, Children2.Id, Children3.Id, tp.Property
        order by Objects1.LeftIndex desc, Objects2.LeftIndex desc, Objects3.LeftIndex desc
    )
    as Idx

    from tp

    -- Select corresponding objects of the triple
    inner join Objects as Objects1 on Objects1.Id = tp.Object1
    inner join Objects as Objects2 on Objects2.Id = tp.Object2
    inner join Objects as Objects3 on Objects3.Id = tp.Object3

    -- Then add all possible children of all those objects
    inner join Objects as Children1 on Children1.LeftIndex between Objects1.LeftIndex and Objects1.RightIndex
    inner join Objects as Children2 on Children2.LeftIndex between Objects2.LeftIndex and Objects2.RightIndex
    inner join Objects as Children3 on Children3.LeftIndex between Objects3.LeftIndex and Objects3.RightIndex
) as x
WHERE idx = 1 
GO

---------- ANSWER 1 SCHEMA --------

CREATE VIEW TPIntermediate AS
select tp.Property, tp.Value 
    , Children1.Id as O1, Children2.Id as O2, Children3.Id as O3
    , Objects1.LeftIndex as PL1, Objects2.LeftIndex as PL2, Objects3.LeftIndex as PL3    
    , Children1.LeftIndex as CL1, Children2.LeftIndex as CL2, Children3.LeftIndex as CL3    
    from tp

    -- Select corresponding objects of the triple
    inner join Objects as Objects1 on Objects1.Id = tp.Object1
    inner join Objects as Objects2 on Objects2.Id = tp.Object2
    inner join Objects as Objects3 on Objects3.Id = tp.Object3

    -- Then add all possible children of all those objects
    inner join Objects as Children1 WITH (INDEX(ix_LeftIndex)) on Children1.LeftIndex between Objects1.LeftIndex and Objects1.RightIndex
    inner join Objects as Children2 WITH (INDEX(ix_LeftIndex)) on Children2.LeftIndex between Objects2.LeftIndex and Objects2.RightIndex
    inner join Objects as Children3 WITH (INDEX(ix_LeftIndex)) on Children3.LeftIndex between Objects3.LeftIndex and Objects3.RightIndex
GO

---------- ANSWER 2 SCHEMA --------

-- Partial calculation using an indexed view
-- Circumvented the self-join limitation using a black magic technique, based on 
-- http://jmkehayias.blogspot.com/2008/12/creating-indexed-view-with-self-join.html
CREATE TABLE dbo.multiplier (i INT PRIMARY KEY)

INSERT INTO dbo.multiplier VALUES (1) 
INSERT INTO dbo.multiplier VALUES (2) 
INSERT INTO dbo.multiplier VALUES (3) 
GO

CREATE VIEW TPIndexed
WITH SCHEMABINDING
AS

SELECT tp.Object1, tp.object2, tp.object3, tp.property, tp.value,
    SUM(ISNULL(CASE M.i WHEN 1 THEN Objects.LeftIndex ELSE NULL END, 0)) as PL1,
    SUM(ISNULL(CASE M.i WHEN 2 THEN Objects.LeftIndex ELSE NULL END, 0)) as PL2,
    SUM(ISNULL(CASE M.i WHEN 3 THEN Objects.LeftIndex ELSE NULL END, 0)) as PL3,
    SUM(ISNULL(CASE M.i WHEN 1 THEN Objects.RightIndex ELSE NULL END, 0)) as PR1,
    SUM(ISNULL(CASE M.i WHEN 2 THEN Objects.RightIndex ELSE NULL END, 0)) as PR2,
    SUM(ISNULL(CASE M.i WHEN 3 THEN Objects.RightIndex ELSE NULL END, 0)) as PR3,
    COUNT_BIG(*) as ID
    FROM dbo.tp
    cross join dbo.multiplier M 
    inner join dbo.Objects 
    on (M.i = 1 AND Objects.Id = tp.Object1)
    or (M.i = 2 AND Objects.Id = tp.Object2)
    or (M.i = 3 AND Objects.Id = tp.Object3)
GROUP BY tp.Object1, tp.object2, tp.object3, tp.property, tp.value
GO

-- This index is mostly useless but required
create UNIQUE CLUSTERED index pk_TPIndexed on dbo.TPIndexed(property, value, object1, object2, object3)
-- Once we have the clustered index, we can create a nonclustered that actually addresses our needs
create NONCLUSTERED index ix_TPIndexed on dbo.TPIndexed(property, value, PL1, PL2, PL3, PR1, PR2, PR3)
GO

-- NOTE: this View is not indexed, but is uses the indexed view 
CREATE VIEW TPIndexedResultView AS
Select O1, O2, O3, Property, Value
FROM
(
    select Children1.Id as O1, Children2.Id as O2, Children3.Id as O3, tp.Property, tp.Value,

    row_number() over( 
        partition by tp.Property, Children1.Id, Children2.Id, Children3.Id
        order by tp.Property, Tp.PL1 desc, Tp.PL2 desc, Tp.PL3 desc
    )
    as Idx

    from TPIndexed as TP WITH (NOEXPAND)

    -- Then add all possible children of all those objects
    inner join Objects as Children1 WITH (INDEX(ix_LeftIndex)) on Children1.LeftIndex between TP.PL1 and TP.PR1
    inner join Objects as Children2 WITH (INDEX(ix_LeftIndex)) on Children2.LeftIndex between TP.PL2 and TP.PR2
    inner join Objects as Children3 WITH (INDEX(ix_LeftIndex)) on Children3.LeftIndex between TP.PL3 and TP.PR3
) as x
WHERE idx = 1 
GO


-- NOTE: this View is not indexed, but is uses the indexed view 
CREATE VIEW TPIndexedIntermediate AS
select tp.Property, tp.Value 
    , Children1.Id as O1, Children2.Id as O2, Children3.Id as O3
    , PL1, PL2, PL3    
    , Children1.LeftIndex as CL1, Children2.LeftIndex as CL2, Children3.LeftIndex as CL3    
    from TPIndexed as TP WITH (NOEXPAND)

    -- Then add all possible children of all those objects
    inner join Objects as Children1 WITH (INDEX(ix_LeftIndex)) on Children1.LeftIndex between TP.PL1 and TP.PR1
    inner join Objects as Children2 WITH (INDEX(ix_LeftIndex)) on Children2.LeftIndex between TP.PL2 and TP.PR2
    inner join Objects as Children3 WITH (INDEX(ix_LeftIndex)) on Children3.LeftIndex between TP.PL3 and TP.PR3  
GO


---------- ANSWER 3 SCHEMA --------
-- You're talking about making six copies of the TP table
-- If you're going to go that far, you might as well, go the trigger route
-- The performance profile is much the same - slower on insert, faster on read
-- And instead of still recalculating on every read, you'll be recalculating
-- only when the data changes. 

CREATE TABLE TPResult
(
    Object1 int not null references Objects,
    Object2 int not null references Objects,
    Object3 int not null references Objects,
    Property varchar(20) not null,
    Value varchar(50) not null
)
GO

create UNIQUE index ix_Result on TPResult(Property, Value, Object1, Object2, Object3)


--You'll have to imagine this trigger, sql fiddle doesn't want to do it
--CREATE TRIGGER tr_TP
--ON TP
--  FOR INSERT, UPDATE, DELETE
--AS
--  DELETE FROM TPResult
-- -- For this example we'll just insert into the table once
INSERT INTO TPResult 
SELECT O1, O2, O3, Property, Value 
FROM TPResultView

Query part of my answer from sqlfiddle:

-------- QUESTION QUERY ----------
-- Original query, modified to use the view I added
SELECT O1, O2, O3, Property, Value 
FROM TPResultView
WHERE property = 'P1' AND value = 'abc'
-- Your assertion is that this order by is the most expensive part. 
-- Sometimes converting queries into views allows the server to
-- Optimize them better over time.
-- NOTE: removing this order by has no effect on this query.
-- ORDER BY O1, O2, O3
GO

-------- ANSWER 1  QUERY ----------
-- A different way to get the same result. 
-- Query optimizer says this is more expensive, but I've seen cases where
-- it says a query is more expensive but it returns results faster.
SELECT O1, O2, O3, Property, Value
FROM (
  SELECT A.O1, A.O2, A.O3, A.Property, A.Value
  FROM TPIntermediate A
  LEFT JOIN TPIntermediate B ON A.O1 = B.O1
    AND A.O2 = B.O2
    AND A.O3 = B.O3
    AND A.Property = B.Property
    AND 
    (
      -- Find any rows with Parent LeftIndex triplet that is greater than this one
      (A.PL1 < B.PL1
      AND A.PL2 < B.PL2
      AND A.PL3 < B.PL3) 
    OR
      -- Find any rows with LeftIndex triplet that is greater than this one
      (A.CL1 < B.CL1
      AND A.CL2 < B.CL2
      AND A.CL3 < B.CL3)
    )
  -- If this row has any rows that match the previous two cases, exclude it
  WHERE B.O1 IS NULL ) AS x
WHERE property = 'P1' AND value = 'abc'
-- NOTE: Removing this order _DOES_ reduce query cost removing the "sort" action
-- that has been the focus of your question.   
-- Howeer, it wasn't clear from your question whether this order by was required.
--ORDER BY O1, O2, O3
GO

-------- ANSWER 2  QUERIES ----------
-- Same as above but using an indexed view to partially calculate results

SELECT O1, O2, O3, Property, Value 
FROM TPIndexedResultView
WHERE property = 'P1' AND value = 'abc'
-- Your assertion is that this order by is the most expensive part. 
-- Sometimes converting queries into views allows the server to
-- Optimize them better over time.
-- NOTE: removing this order by has no effect on this query.
--ORDER BY O1, O2, O3
GO

SELECT O1, O2, O3, Property, Value
FROM (
  SELECT A.O1, A.O2, A.O3, A.Property, A.Value
  FROM TPIndexedIntermediate A
  LEFT JOIN TPIndexedIntermediate B ON A.O1 = B.O1
    AND A.O2 = B.O2
    AND A.O3 = B.O3
    AND A.Property = B.Property
    AND 
    (
      -- Find any rows with Parent LeftIndex triplet that is greater than this one
      (A.PL1 < B.PL1
      AND A.PL2 < B.PL2
      AND A.PL3 < B.PL3) 
    OR
      -- Find any rows with LeftIndex triplet that is greater than this one
      (A.CL1 < B.CL1
      AND A.CL2 < B.CL2
      AND A.CL3 < B.CL3)
    )
  -- If this row has any rows that match the previous two cases, exclude it
  WHERE B.O1 IS NULL ) AS x
WHERE property = 'P1' AND value = 'abc'
-- NOTE: Removing this order _DOES_ reduce query cost removing the "sort" action
-- that has been the focus of your question.   
-- Howeer, it wasn't clear from your question whether this order by was required.
--ORDER BY O1, O2, O3
GO



-------- ANSWER 3  QUERY ----------
-- Returning results from a pre-calculated table is fast and easy
-- Unless your are doing many more inserts than reads, or your result
-- set is very large, this is a fine way to compensate for a poor design
-- in one area of your database.
SELECT Object1 as O1, Object2 as O2, Object3 as O3, Property, Value 
FROM TPResult
WHERE property = 'P1' AND value = 'abc'
ORDER BY O1, O2, O3
BitwiseMan
  • 1,887
  • 13
  • 24
  • Unfortunately, half the bounty goes to you just for the upvotes, but your answers don't work either. The "left join instead of window function" option looked like a real winner at first, I was already uncorking a bottle of wine... But turn on real data it turned out just as bad: the hash match implementing the join is O(2N), because there cannot be any indexes on probe columns, which is better than Nlog(N), but still prohibitive. – Fyodor Soikin Sep 04 '12 at 21:53
  • Local optimizations, such as indexing some parts if the query - of course I've done that, and in more places than you suggested. Sadly, these only reduce the constant and do nothing about the complexity itself. – Fyodor Soikin Sep 04 '12 at 22:00
  • And finally, the "rebuild and index the whole result set on every change" approach. I tried it on real data, just for fun. Quite understandably, takes on the order on tens of second on my development machine. – Fyodor Soikin Sep 04 '12 at 22:15
  • As for the vague philosophy about SQL management studio being "not enough" and "inefficient" (more formally known as "first measure, then optimize"), of course I know that, and of course I make sure that the tools I use really reflect the actual state of things. On "query changes alone may not be enough" - did you intend to say that I did not realize this? I myself proposed a change in schema in the form of six extra tables, did I not? – Fyodor Soikin Sep 04 '12 at 22:35
  • And lastly, on the notion of six tables being not enough for an indexed view, because there is also a window function. I do not need to encapsulate the window function in the indexed view. If I add those six tables, I will be able to index everything that goes BEFORE the window function. This will make the sorting instant (i.e. get rid if sorting altogether), which in turn will make the window function execute instantly as well. By the way, I have already explained this in the comments to the original post. – Fyodor Soikin Sep 04 '12 at 22:36
  • 1
    Wow, talk about a "Thank you" with an implied "for nothing" on the end. It's comment chains like the one above that made me include the warnings. You seem upset that anyone got any credit for helping you because they didn't magically make your query orders of magnitude faster. Sorry. When I said "schema/design issues" didn't mean you seemed unwilling to make some changes, I meant you seemed unwilling to consider that the real problem your expectations given your requirements of your "inheriting triplets with properties" _design_ as whole. – BitwiseMan Sep 04 '12 at 23:58
  • I had a problem that i could not solve satisfactorily on my own, and I asked people if they had a better solution than I did. I offered a substantial reward (as much as SO would let me) for a better solution. Nobody was able to come up with a better solution, the desire to help notwithstanding. I did provide detailed explanation why the provided solutions did not work. Some of the reward did go to a response that did not answer my question. Are you blaming me for not being grateful or happy now? Grateful for what and to whom? Happy about what? Do I not have every right to be upset? – Fyodor Soikin Sep 05 '12 at 00:38
0

You might speed this up by materializing the join in an indexed table, say joinedresult. This has the disadvantage of needing space and saving to disk. But it has the advantage of being able to use an index for the slow part.

insert into joinedresult
select Children1.Id as O1, Children2.Id as O2, Children3.Id as O3, tp.Property, tp.Value,Objects1.[depthInTheTree] as O1D,Objects2.[depthInTheTree] as O2D,Objects3. depthInTheTree]  as O3D from  ... (see above)

Make sure joinedresult has an index on [O1,O2,O3,Property,O1D,O2D,O3D] and clear it before running. Then

select * from
(
    select 
    Children1.Id as O1, Children2.Id as O2, Children3.Id as O3, tp.Property, tp.Value,
    row_number() over( 
        partition by Children1.Id, Children2.Id, Children3.Id, tp.Property
        order by O1D descending, O2D descending, O3D descending
    )
    as InheritancePriority
    from joinedresult
)
where InheritancePriority = 1
IvoTops
  • 3,463
  • 17
  • 18
  • 1
    The row_number() part is the problem. – edze Aug 21 '12 at 12:14
  • 1
    First, building the index is just as slow as sorting, or even slower. Second, I need this whole thing as an expression, not a procedure, so that it can be used as a part of other expressions. – Fyodor Soikin Aug 21 '12 at 12:55
0

Have you tried an index (or set the pk) with the "Value" column first, the "Property" column second, the "Object1" column third, the "Object2" column fourth, and "Object3" column fifth? I am assuming that "Value" is more restrictive than "Property".

I am also assuming that you have the Id column set as the primary key, and a foreign key relationship between ParentId and Id.

How does this query perform?:

    with 
    -- First, get all combinations that match the property/value pair.
    validTrip as (
        select Object1, Object2, Object3
        from TriplesAndProperties 
        where value = @value
            and property = @property
    ),
    -- Recursively flatten the inheritance hierarchy of Object1, 2 and 3.
    o1 as (
        select Id, 0 as InherLevel from Objects where Id in (select Object1 from validTrip)
        union all
        select rec.Id, InherLevel + 1 from Objects rec inner join o1 base on rec.Parent = base.[Object]
    ),
    o2 as (
        select Id, 0 as InherLevel from Objects where Id in (select Object2 from validTrip)
        union all
        select rec.Id, InherLevel + 1 from Objects rec inner join o2 base on rec.Parent = base.[Object]
    ),
    o3 as (
        select Id, 0 as InherLevel from Objects where Id in (select Object3 from validTrip)
        union all
        select rec.Id, InherLevel + 1 from Objects rec inner join o3 base on rec.Parent = base.[Object]
   )
    -- select the Id triple.
    select o1.Id, o2.Id, o3.Id N
    -- match every option in o1, with every option in o2, with every option in o3.
    from o1
        cross join o2
        cross join o3
    -- order by the inheritance level.
    order by o1.InherLevel, o2.InherLevel, o3.InherLevel;
Sako73
  • 9,957
  • 13
  • 57
  • 75
  • Yes, I have tried numerous combinations of keys, this particular one included. It is now evident for me, however (based on the execution plans), that the bottleneck is in sorting, and because of the sorting term, no indexes on any tables can help it. Just think about it. – Fyodor Soikin Aug 25 '12 at 01:18
  • Your query does not work, for two reasons. First reason: your query does not produce a desired result. Here is a counterexample. Say the TriplesAndProperties table contains two rows: (1,2,3,P1,V1) and (1,4,5,P1,V1), and @property = 'P1' and @value = 'V1', and none of the objects has a parent. Then your `validTrip` cte will contain two rows: (1,2,3) and (1,4,5). Then your `o1` cte will have two rows: (Id=1) and (Id=1). Your `o2` cte will have two rows: (Id=2) and (Id=4). And your `o3` cte will have two rows: (Id=3) and (Id=5). InherLevel will be zero in all of them. (continued) – Fyodor Soikin Aug 25 '12 at 01:22
  • Then your triple cross join will produce the following eight ( = two by two by two) rows: (1,2,3), (1,2,5), (1,4,3), (1,4,5), (1,2,3), (1,2,5), (1,4,3), (1,4,5). The desired result, on the other hand, should be exactly equal to your `validTrip` cte, because there is no impact from inheritance. In other words, you are producing all possible triples, whereas I only need the ones that are "children" of those from `validTrip`. – Fyodor Soikin Aug 25 '12 at 01:26
  • As a side note, I do not understand what is the purpose of sorting by inheritance level. I need to produce those triples, but I do not care what order they are in. – Fyodor Soikin Aug 25 '12 at 01:27
  • And the second reason is that your query is vastly inefficient because of the way recursive CTEs work in SQL Server: there is no magic mathematics to it, the server just blindly evaluates the whole table by recursively applying the expression. This is totally fine when the expression is top-level: the server simply returns the results to the client as they become available. But when you are trying to join the results, this requires the server to evaluate each expression in whole, then offload results to a temporary table before performing the join. A bit more complex than simply sorting. :-) – Fyodor Soikin Aug 25 '12 at 01:31
  • As far as the sorting, I must have misunderstood the requirements in your question. I thought you were asking to have them sorted in the order of distance from the original object. – Sako73 Aug 25 '12 at 03:19
  • I do not want them sorted by the distance from the original object. Instead, I need to select only ONE row for every original object, and that row must be the one with the smallest distance. That is where the sorting comes from. – Fyodor Soikin Aug 25 '12 at 15:58
0

Hierarchical queries, that is WITH RECURSIVE ... or proprietary equivalents such as CONNECT BY are your friend in this case.

The recipe to solve your particular problem is: Start at the leave and ascend to the root aggregating and excluding anything that's been found already.

Dude
  • 583
  • 2
  • 9
  • I don't understand how this is supposed to help. Are you sure you've read the question carefully? – Fyodor Soikin Aug 30 '12 at 21:45
  • Ooops. Well, now I did. You want about the inverse. Can't you just create a persistent flat view (that is an indexable table maintained by your triggers) of the structure (Obj,FromParent,Prop,Value) to get the answer from a trivial query? – Dude Aug 31 '12 at 10:13
  • Then I will have to maintain that table on updates manually, which will be a major source of bugs. In my book, this is even less preferable than keeping six copies of the objects table. – Fyodor Soikin Aug 31 '12 at 14:04
  • You can use triggers to maintain that table on updates. I have successfully used this strategy using a single flat table that reflects a hierarcical model. In my case the properties are permissions. As the initial implementation based on an SQL view dynamically traversing the hierarchy turned out to be a performance bottleneck, I decided to write a "cached and indexable view" based on a table. The overall system performance improved by factors as all reading queries can use index scans, now. – Dude Sep 01 '12 at 17:32
  • Six copies of the objects table is way simpler than keeping the whole thing. So I will stick with the six copies strategy for now. By the way, in my case properties are also permissions. Also, for efficient representation if trees in a relational database, check out this: http://blog.monstuff.com/archives/000026.html. Would have worked in your case. – Fyodor Soikin Sep 01 '12 at 19:26
  • 1
    Thanks for the interesting link, although it wouldn't have been an appropriate design choice in my case because of the required maintenance overhead. I'm very happy with fast-as-can-be permission checks and decent write performance as well. There is nothing wrong using six tables (or whatever number of tables you need for the model) - my point was: Rotate the problem by 180 degrees, that is perform the inheritance when the model is changed to cache the results in a table and get the fastest possible read access. The overhead when writing is probably tolerable and things get much simpler. – Dude Sep 02 '12 at 21:13
0

I am guessing that your table is fairly big. Hence the slowness. In that case I am also guessing that you have multiple properties (2 to many). I this case I would suggest you move the "where property= 'P1'" inside the CTE. This would filter a good portion of the data making your query as fast as up to the number of properties times faster.

Something like : http://sqlfiddle.com/#!3/7c7a0/92/0

Dumitrescu Bogdan
  • 7,127
  • 2
  • 23
  • 31
  • If you look at your execution plan, you'll see that sorting still takes 33% of the whole query. This happens because there are no indexes that could be used for sorting, and therefore, the server has to calculate the whole intermediate set first and then sort it. This operation depends on the number of rows being sorted more than linearly, which translates into the third power of the number of objects. In other words, you didn't solve the problem: the bottleneck is still there. – Fyodor Soikin Aug 30 '12 at 21:51
  • As for little local optimization tricks, such as filtering by the property first, - I've done them all and more. While they do reduce the complexity factor somewhat, the principal dependency on the third power of the number of objects stays in place, which leads to more slowness as the system grows. – Fyodor Soikin Aug 30 '12 at 21:53
0

Caching is the KEY to making a query faster. It reduces the calculations you must make. You want to create an index, because you want to CACHE, and save WORK. Below are two possibilities to do this.

Option 1

The SQL database sorts because of your windowing function. And you say the windowing function is too slow as it is.

I don't know how well this will work, but it might work.

Instead of sorting by a number of columns, you could try sorting by a single column - "closeness".

Let's define closeness as some abstract integer for now. Instead of your windowing function, you can instead have the following SQL:

select * from
(
    select
        Children1.Id as O1, Children2.Id as O2, Children3.Id as O3, tp.Property, tp.Value,

        row_number() over( 
            partition by Children1.Id, Children2.Id, Children3.Id, tp.Property
            order by closeness DESC
        )
        as InheritancePriority
    from
    ... (see above)
)
where InheritancePriority = 1

closeness can be a column defined in the TriplesAndProperties table. For each object, you can define its "closeness", as the distance it is away from the root node (O1). Then the we can define closeness(tuple) = closeness(Object1)*100+closeness(Object2)*10+closeness(Object3)

This way, the tuple with furtherest from the root is what you want.

To avoid sorting, you just have to make sure that closeness is indexed.


Option 2

I am VERY sure that this will work.

Define your TriplesAndProperties table to have the columns: Object1, Object2, Object3, Property, Value, Effective_Object1, Effective_Object2, Effective_Object3, Closeness.

Notice that here I also define closeness as a column.

When you insert/update a tuple into your table, (X,Y,Z), instead, you want to insert:

(X,Y,Z,Property,Value,X,Y,Z,0)
(X,Y,Z,Property,Value,X,Y,Z.child,1)
(X,Y,Z,Property,Value,X,Y,Z.grandchild,2)
(X,Y,Z,Property,Value,X,Y.child,Z,10)
(X,Y,Z,Property,Value,X,Y.child,Z.child,11)
(X,Y,Z,Property,Value,X,Y.child,Z.grandchild,12)
(X,Y,Z,Property,Value,X,Y.grandchild,Z,20)
(X,Y,Z,Property,Value,X,Y.grandchild,Z.child,21)
(X,Y,Z,Property,Value,X,Y.grandchild,Z.grandchild,22)
...
...

This means that instead of inserting/updating/destroying a single row in your table, you are inserting up to ~20 rows. That's not too bad.

Then your query is VERY EASY.

You just say:

SELECT * FROM
    (
    SELECT Effective_Object1, Effective_Object2, Effective_Object3, Property, Value,
        row_number() over( 
            partition by Effective_Object1, Effective_Object2, Effective_Object3, Property
            order by Closeness DESC
        ) AS InheritancePriority FROM TriplesAndProperties
     ) WHERE InheritancePriority = 1;

In this option, you have to make sure closeness is indexed, you can just index by the tuple (Effective_Object1, Effective_Object2, Effective_Object3, Property, Closeness).


In both cases, you have some amount of caching, which is data which doesn't add any additional information as such, but which caches a certain amount of calculation or work.

ronalchn
  • 12,225
  • 10
  • 51
  • 61