2

Note: I have a working query, but am looking for optimisations to use it on large tables.

Suppose I have a table like this:

id  session_id  value
1       5           7
2       5           1
3       5           1
4       5           12
5       5           1
6       5           1
7       5           1
8       6           7
9       6           1
10      6           3
11      6           1
12      7           7
13      8           1
14      8           2
15      8           3

I want the id's of all rows with value 1 with one exception: skip groups with value 1 that directly follow a value 7 within the same session_id.

Basically I would look for groups of value 1 that directly follow a value 7, limited by the session_id, and ignore those groups. I then show all the remaining value 1 rows.

The desired output showing the id's:

5
6
7
11
13

I took some inspiration from this post and ended up with this code:

declare @req_data table (
    id int primary key identity,
    session_id int,
    value int
)

insert into @req_data(session_id, value) values (5, 7)
insert into @req_data(session_id, value) values (5, 1)  -- preceded by value 7 in same session, should be ignored
insert into @req_data(session_id, value) values (5, 1)  -- ignore this one too
insert into @req_data(session_id, value) values (5, 12)
insert into @req_data(session_id, value) values (5, 1)  -- preceded by value != 7, show this
insert into @req_data(session_id, value) values (5, 1)  -- show this too
insert into @req_data(session_id, value) values (5, 1)  -- show this too
insert into @req_data(session_id, value) values (6, 7)
insert into @req_data(session_id, value) values (6, 1)  -- preceded by value 7 in same session, should be ignored
insert into @req_data(session_id, value) values (6, 3)
insert into @req_data(session_id, value) values (6, 1)  -- preceded by value != 7, show this
insert into @req_data(session_id, value) values (7, 7)
insert into @req_data(session_id, value) values (8, 1)  -- new session_id, show this
insert into @req_data(session_id, value) values (8, 2)
insert into @req_data(session_id, value) values (8, 3)



select id
from (
    select session_id, id, max(skip) over (partition by grp) as 'skip'
    from (
        select tWithGroups.*,
            ( row_number() over (partition by session_id order by id) - row_number() over (partition by value order by id) ) as grp
        from (
            select session_id, id, value,
                case
                    when lag(value) over (partition by session_id order by session_id) = 7
                        then 1
                    else 0
                end as 'skip'
            from @req_data
        ) as  tWithGroups
    ) as tWithSkipField
    where tWithSkipField.value = 1
) as tYetAnotherOutput
where skip != 1
order by id

This gives the desired result, but with 4 select blocks I think it's way too inefficient to use on large tables.

Is there a cleaner, faster way to do this?

Nico Kempe
  • 93
  • 1
  • 7
  • Take a look at LAG: https://learn.microsoft.com/en-us/sql/t-sql/functions/lag-transact-sql You can look at the previous row. – Keith Sep 05 '17 at 13:07
  • They used `LAG` in the original query @Leonidas199x – S3S Sep 05 '17 at 13:08
  • 3
    I think this belongs on Code Review, not Stack Exchange. It is working code. – Tab Alleman Sep 05 '17 at 13:08
  • @TabAlleman if it does not scale to a practical data volume, then I might still consider it as errant. Like, if your algorithm is O(n!) to sort an array, I wouldn't consider that "working". This code looks to be O(n^2) for an apparent O(n) task, and we should be interested in an efficient set-based solution to avoid writing custom loops in T-SQL. – Elaskanator Nov 14 '19 at 19:27

3 Answers3

2

The following should work well for this.

WITH
    cte_ControlValue AS (
        SELECT 
            rd.id, rd.session_id, rd.value,
            ControlValue = ISNULL(CAST(SUBSTRING(MAX(bv.BinVal) OVER (PARTITION BY rd.session_id ORDER BY rd.id), 5, 4) AS INT), 999)
        FROM
            @req_data rd
            CROSS APPLY ( VALUES (CAST(rd.id AS BINARY(4)) + CAST(NULLIF(rd.value, 1) AS BINARY(4))) ) bv (BinVal)
        )
SELECT 
    cv.id, cv.session_id, cv.value
FROM
    cte_ControlValue cv
WHERE 
    cv.value = 1
    AND cv.ControlValue <> 7;

Results...

id          session_id  value
----------- ----------- -----------
5           5           1
6           5           1
7           5           1
11          6           1
13          8           1

Edit: How and why it works... The basic premise is taken from Itzik Ben-Gan's "The Last non NULL Puzzle".

Essentially, we are relying 2 different behaviors that most people don't usually think about...

1) NULL + anything = NULL. 2) You can CAST or CONVERT an INT into a fixed length BINARY data type and it will continue to sort as an INT (as opposed to sorting like a text string).

This is easier to see when the intermittent steps are added to the query in the CTE...

SELECT 
    rd.id, rd.session_id, rd.value, 
    bv.BinVal,
    SmearedBinVal = MAX(bv.BinVal) OVER (PARTITION BY rd.session_id ORDER BY rd.id),
    SecondHalfAsINT = CAST(SUBSTRING(MAX(bv.BinVal) OVER (PARTITION BY rd.session_id ORDER BY rd.id), 5, 4) AS INT),
    ControlValue = ISNULL(CAST(SUBSTRING(MAX(bv.BinVal) OVER (PARTITION BY rd.session_id ORDER BY rd.id), 5, 4) AS INT), 999)
FROM
    #req_data rd
    CROSS APPLY ( VALUES (CAST(rd.id AS BINARY(4)) + CAST(NULLIF(rd.value, 1) AS BINARY(4))) ) bv (BinVal)

Results...

id          session_id  value       BinVal             SmearedBinVal      SecondHalfAsINT ControlValue
----------- ----------- ----------- ------------------ ------------------ --------------- ------------
1           5           7           0x0000000100000007 0x0000000100000007 7               7
2           5           1           NULL               0x0000000100000007 7               7
3           5           1           NULL               0x0000000100000007 7               7
4           5           12          0x000000040000000C 0x000000040000000C 12              12
5           5           1           NULL               0x000000040000000C 12              12
6           5           1           NULL               0x000000040000000C 12              12
7           5           1           NULL               0x000000040000000C 12              12
8           6           7           0x0000000800000007 0x0000000800000007 7               7
9           6           1           NULL               0x0000000800000007 7               7
10          6           3           0x0000000A00000003 0x0000000A00000003 3               3
11          6           1           NULL               0x0000000A00000003 3               3
12          7           7           0x0000000C00000007 0x0000000C00000007 7               7
13          8           1           NULL               NULL               NULL            999
14          8           2           0x0000000E00000002 0x0000000E00000002 2               2
15          8           3           0x0000000F00000003 0x0000000F00000003 3               3

Looking at the BinVal column, we see an 8 byte hex value for all non-[value] = 1 rows and NULLS where [value] = 1... The 1st 4 bytes are the Id (used for ordering) and the 2nd 4 bytes are [value] (used to set the "previous non-1 value" or set the whole thing to NULL.

The 2nd step is to "smear" the non-NULL values into the NULLs using the window framed MAX function, partitioned by session_id and ordered by id.

The 3rd step is to parse out the last 4 bytes and convert them back to an INT data type (SecondHalfAsINT) and deal with any nulls that result from not having any non-1 preceding value (ControlValue).

Since we can't reference a windowed function in the WHERE clause, we have to throw the query into a CTE (a derived table would work just as well) so that we can use the new ControlValue in the where clause.

Jason A. Long
  • 4,382
  • 1
  • 12
  • 17
  • 2
    This is a clever solution, but it really needs an explanation. – GarethD Sep 05 '17 at 14:21
  • 2
    This is a variation of Itzik Ben-Gan's solution to the Last non-NULL puzzle. http://sqlmag.com/t-sql/last-non-null-puzzle... In this case the Idead is, we want all rows with a [value] = 1 but only if the preceding non-1 value isn't = 7... So, by treating the [value] = 1 as nulls, we can use Itzik's method to get the "last non-null" value... Once we have that, the final where clause become really easy to write. I'll go ahead and update the answer a better explanation as well. – Jason A. Long Sep 05 '17 at 14:40
  • A clever solution with a very helpful explanation, thanks. Performance-wise it seems to be in second place, but I'm no expert at measuring that. (I used "SET STATISTICS TIME ON" and "SET STATISTICS IO ON" to get some stats) – Nico Kempe Sep 07 '17 at 10:01
  • 1
    I'd be interest in seeing your test. Using a #temp table in place of a @table variable, so that statics can be created... with the original test data and no additional indexes. I'm coming up with the following... binary concat: total scan count = 20; total logical reads = 93; cpu time = 0 elapsed time = 00:00:00.001 /// gaps & islands: total scan count = 23; total logical reads = 156; cpu time = 0; elapsed time = 00:00:00.046 – Jason A. Long Sep 07 '17 at 13:16
  • adding a covering nonclustered index on session_id & id removes the sort (77% of the plan cost) from the binary concat solution, making it ~ 5 times faster. – Jason A. Long Sep 07 '17 at 13:21
0

You can use the following query:

select id, session_id, value,
          coalesce(sum(case when value <> 1 then 1 end) 
                   over (partition by session_id order by id), 0) as grp
from @req_data

to get:

id  session_id  value   grp
----------------------------
1   5           7       1
2   5           1       1
3   5           1       1
4   5           12      2
5   5           1       2
6   5           1       2
7   5           1       2
8   6           7       1
9   6           1       1
10  6           3       2
11  6           1       2
12  7           7       1
13  8           1       0
14  8           2       1
15  8           3       2

So, this query detects islands of consecutive 1 records that belong to the same group, as specified by the first preceding row with value <> 1.

You can use a window function once more to detect all 7 islands. If you wrap this in a second cte, then you can finally get the desired result by filtering out all 7 islands:

;with session_islands as (
   select id, session_id, value,
          coalesce(sum(case when value <> 1 then 1 end) 
                   over (partition by session_id order by id), 0) as grp
   from @req_data
), islands_with_7 as (
   select id, grp, value,
          count(case when value = 7 then 1 end) 
          over (partition by session_id, grp) as cnt_7
   from session_islands
)
select id 
from islands_with_7
where cnt_7 = 0 and value = 1
Giorgos Betsos
  • 71,379
  • 9
  • 63
  • 98
  • @NicoKempe Glad I was able to help and welcome to Stack Overflow. Please mark this, or any other, answer as accepted if it helped you solve your problem. – Giorgos Betsos Sep 07 '17 at 09:59
  • This works, though it takes about 15ms time on a table with 1000 records. For comparison, other scripts take 3-12ms. According to the actual execution plan, there are two sort actions taking 32% each. EDIT: Sorry, I got my comments the wrong way around.. – Nico Kempe Sep 07 '17 at 12:58
0
SELECT CRow.id
FROM @req_data AS CRow
CROSS APPLY (SELECT MAX(id) AS id FROM @req_data PRev WHERE PRev.Id < CRow.id  AND PRev.session_id = CRow.session_id AND  PRev.value <> 1 ) MaxPRow
LEFT JOIN @req_data AS PRow ON MaxPRow.id = PRow.id
WHERE CRow.value = 1 AND ISNULL(PRow.value,1) <> 7
pblack
  • 778
  • 1
  • 6
  • 13
  • This seems to be the solution with the lowest impact on the server. So far I haven't been able to find any gaps in the output, so that's good. :) EDIT: Sorry, I got my comments the wrong way around.. – Nico Kempe Sep 07 '17 at 12:55