SQL Self Join to create Header and Line level columns

Question

I've a table with the following data.

Item number	Date	Reference	DocNumber	Qty	Site
PLU-1000	4-JAN	Header	JRN - 523	2	SFT
RIN-000138	4-JAN	Line	JRN - 523	2	SFT
RIN-000096	4-JAN	Line	JRN - 523	6	SFT
RIN-000247	4-JAN	Line	JRN - 523	3	SFT
PLU-1001	4-JAN	Header	JRN - 523	5	SFT
RIN-000789	4-JAN	Line	JRN - 523	15	SFT
RIN-000001	4-JAN	Line	JRN - 523	10	SFT
RIN-000247	4-JAN	Line	JRN - 523	5	SFT
RIN-000031	4-JAN	Line	JRN - 523	2.5	SFT

I'm trying to create a table in the following format assuming items are ordered and each 'RIN' item is related to former 'PLU' item:

HeaderItem	Line Item	Date	Reference	DocNumber	Qty	Site
PLU-1000	RIN-000138	4-JAN	Line	JRN - 523	2	SFT
PLU-1000	RIN-000096	4-JAN	Line	JRN - 523	6	SFT
PLU-1000	RIN-000247	4-JAN	Line	JRN - 523	3	SFT
PLU-1001	RIN-000789	4-JAN	Line	JRN - 523	15	SFT
PLU-1001	RIN-000001	4-JAN	Line	JRN - 523	10	SFT
PLU-1001	RIN-000247	4-JAN	Line	JRN - 523	5	SFT
PLU-1001	RIN-000031	4-JAN	Line	JRN - 523	2.5	SFT

I have to remove the original header row and add the item number in front of line items repeatedly until the next 'PLU' item comes.

Is there an Id or other PK key column in the source set? We need to know the sort order to give you the result. The Create table script for the source table will remove ambiguity — Chris Schaller, Jan 11 '22 at 05:59
There isn't any PK column as this is a transactional table but I've got another column for each transaction named 'Lot ID'. Please see below image: https://ibb.co/MfmrvYj — Umer Shamshad, Jan 11 '22 at 06:34
There should usually be a link field for such lists. That is, the list should be such that the child's records can be accessed based on the parent's ID. There is not even an option in your list to sort through which the list can be sorted and then the next records from each parent can be obtained. Is there an option to sort? For example, is record time recorded in minutes and seconds in your database? — AliNajafZadeh, Jan 11 '22 at 06:48
@UmerShamshad having a sort key is imperative for this type of dataset in MS SQL Server. Please update your post to include a column that can be used to determine the correct sort order, like this `Lot ID` you have identified in your posted image. It is dangerous for us as a community to leave this example that does not have an obvious sort order. Dangerous because we don't want you to learn bad habits or solutions that are not pracitcal in the commercial world. — Chris Schaller, Jan 12 '22 at 08:20

score 3 · Answer 1 · answered Jan 11 '22 at 08:17

3

This type of analysis is often referred to as Gap and Island analysis. This solution is similar to that of @Saeed EmamYari, but takes advantage of the natural key Lot ID you have identified in the comments about the real data set. We need a real value in the dataset that can be used to maintain the sort order, we can't use a virtual ROW_NUMBER based off a faked column value as this can lead to unpredictable results.

The only other major difference in this solution is that a CASE statement is used to create the HeaderItemNumber column, instead of a self-join, this should result in a cleaner execution plan.

WITH BoundaryData as (
    SELECT CASE Reference WHEN 'Header' THEN [Item number] END as HeaderItemNumber, *
    FROM Items 
)
, GroupedData as (
    SELECT COUNT(HeaderItemNumber) OVER (ORDER BY [Lot ID]) as HeaderGroup, *
    FROM BoundaryData
)
, HeaderedData as (
    SELECT MAX(HeaderItemNumber) OVER (PARTITION BY HeaderGroup) as HeaderItem, *
    FROM GroupedData
)
SELECT HeaderItem, [Item number], Date, Reference, DocNumber, Qty, Site
FROM HeaderedData
WHERE Reference <> 'Header'
ORDER BY [Lot ID]

I've captured this in a fiddle: http://sqlfiddle.com/#!18/d792dd/3
NOTE: In the fiddle I have generated Lot ID as an identity column, the value is not relevant, only the sequence.

answered Jan 11 '22 at 08:17

Chris Schaller

13,704
3
43
81

1

If there was an ID field, I would definitely use this field, but it did not question this option. Your solution is also interesting :) – Jan 11 '22 at 11:16
If this is a transactional table then there will be a date or some other field that can be used to determine the order, even if you have chosen not to use an incrementing PK. Putting a fake `ROW_NUMBER` like in the other answer is not guaranteed to maintain the correct sequence over time it is _very_ bad habit to get into and should be avoided. The image you posted has a valid _natural_ key that can also be used to determine the sequence, so this answer is based on that and is better suited to commercial software and real world solutions. – Chris Schaller Jan 11 '22 at 13:16
1

I have answered the question, if he had put the ID field, it would have been much easier, but my answer is appropriate for the question. Anyway, thank you. – Jan 12 '22 at 05:34
@SaeedEmamYari with respect, an answer like that compromises the integrity of the query. Tell OP to provide an adequate sorting key (which they did in the comments) Lesser developers will copy your answer and use that in a production scenario without realising the impact of doing so. It is both wrong and dangerous to circumvent the checks that SQL Server has built into the runtime to prevent you from being so careless. If you do not agree or understand this then I do not think you are qualified to continue this discussion. – Chris Schaller Jan 12 '22 at 08:16
I am not against this. But there is an additional field in your solution that is not in the question. If this field existed in the question, it would be much easier than your query. – Jan 12 '22 at 08:23
The second line of the question also states that the table data is assumed to be in order – Jan 12 '22 at 08:45
Yes, OP asked a _very silly_ question, but they will not learn anything useful by blindly giving them what they ask. In code and data we never ass-u-me, it makes an "_Ass_ out of _U_ and _Me_". In MS SQL Server the order of results if not specified is governed by the position of the records in the clustered index, which may very well be fragmented. So especially when we use window query functions where the order is significant like `COUNT` or `ROW_NUMBER` an `ORDER BY` clause is required. You could substitute your order by hack into this solution, but I wouldn't publish it against my name. – Chris Schaller Jan 12 '22 at 12:09
Anyway I saw this question was positive. If he puts in the ID field and asks a question, I will still try to give the most appropriate answer. Thank you for paying attention to people's learning. – Jan 12 '22 at 12:21
Thank you everyone for your support. I know I should have put LotID field in the question which specifies the order of transactions as I wasn't aware of its significance. Anyways, thank you again for your kind help. – Umer Shamshad Jan 12 '22 at 20:12

score 1 · Answer 2 · 2022-01-11T06:45:51.733

Use this code :

WITH PartitionedDataPoints2 AS
    (
    select *
           ,c=COUNT(ItemNumber) OVER (ORDER BY rn)
    from    
         (SELECT t2.ItemNumber
            ,t1.Date
            ,t1.ItemNumber as SItemNumber
            ,t1.Reference
            ,t1.DocNumber
            ,t1.Qty
            ,t1.Site
            ,ROW_NUMBER() over (order by (select 1)) as rn
       FROM [Test].[dbo].[Sheet1]  t1
       left join (select ItemNumber from [Test].[dbo].[Sheet1] where Reference='Header') t2 
       on t1.ItemNumber= t2.ItemNumber)t1
    ),
        GroupedDataPoints AS
    (
        SELECT c, s=MAX(ItemNumber)
        FROM PartitionedDataPoints2
        GROUP BY c
    )

    SELECT  
         a.s  as HeaderItem
        ,b.Date  
        ,SItemNumber as LineItem    
        ,Reference
        ,DocNumber
        ,Qty
        ,Site
    FROM GroupedDataPoints a
    JOIN PartitionedDataPoints2 b ON a.c = b.c and Reference='Line'

Tip : [Test].[dbo].[Sheet1] is my source table

Output:

HeaderItem  LineItem    Date    Reference   DocNumber   Qty Site
PLU-1000    RIN-000138  2022-01-04  Line    JRN - 523   2   SFT
PLU-1000    RIN-000096  2022-01-04  Line    JRN - 523   6   SFT
PLU-1000    RIN-000247  2022-01-04  Line    JRN - 523   3   SFT
PLU-1001    RIN-000789  2022-01-04  Line    JRN - 523   15  SFT
PLU-1001    RIN-000001  2022-01-04  Line    JRN - 523   10  SFT
PLU-1001    RIN-000247  2022-01-04  Line    JRN - 523   5   SFT
PLU-1001    RIN-000031  2022-01-04  Line    JRN - 523   2.5 SFT

Description : first created a table that returns the following values:

ItemNumber  Date    SItemNumber Reference   DocNumber   Qty Site    rn
PLU-1000    2022-01-04  PLU-1000    Header  JRN - 523   2   SFT 1
NULL    2022-01-04  RIN-000138  Line    JRN - 523   2   SFT 2
NULL    2022-01-04  RIN-000096  Line    JRN - 523   6   SFT 3
NULL    2022-01-04  RIN-000247  Line    JRN - 523   3   SFT 4
PLU-1001    2022-01-04  PLU-1001    Header  JRN - 523   5   SFT 5
NULL    2022-01-04  RIN-000789  Line    JRN - 523   15  SFT 6
NULL    2022-01-04  RIN-000001  Line    JRN - 523   10  SFT 7
NULL    2022-01-04  RIN-000247  Line    JRN - 523   5   SFT 8
NULL    2022-01-04  RIN-000031  Line    JRN - 523   2.5 SFT 9

Then I filled in the null values using the corresponding codes and displayed.

`over (order by (select 1))` is _volatile_ and should be avoided. The whole reason that SQL Server forces you to include an `ORDER BY` clause in window queries is to ensure that the results are determinate , in this example you would have achieved the same result from `ORDER BY Site` and it would still produce an ambiguous result. We can't blindly assume that the records will be stored and retrieved in the sequence we see them today, the order needs to be explicitly defined in the query for it to make sense. — Chris Schaller, Jan 11 '22 at 13:24
I just found a valid use-case for your _order by_ hack: https://stackoverflow.com/a/70709715/1690217 in this case it is a valid use because no SQL internals can alter the order of the results. When we query from a table or view the order is _indeterminate_ so that is why we MUST NOT use it to solve OP's issue in this post. — Chris Schaller, Jan 14 '22 at 11:19

SQL Self Join to create Header and Line level columns

2 Answers2