1

I have a table with about 700 million rows, which have the below sample for only one line_id.

LINE_ID|COLLECTION_DATE    |DSL_CARD_TYPE|
-------|-------------------|-------------|
1234567|2020-03-25 08:46:08|ADSL_PORT    |
1234567|2020-03-26 08:31:48|ADSL_PORT    |
1234567|2020-03-27 08:42:40|VDSL_PORT    |
1234567|2020-03-28 08:36:32|VDSL_PORT    |
1234567|2020-03-29 08:31:33|VDSL_PORT    |
1234567|2020-03-30 08:50:15|VDSL_PORT    |
1234567|2020-04-31 08:44:33|ADSL_PORT    |
1234567|2020-03-01 08:34:53|ADSL_PORT    |
1234567|2020-04-02 08:44:11|ADSL_PORT    |
1234567|2020-04-03 08:43:51|VDSL_PORT    |
1234567|2020-04-04 08:54:33|ADSL_PORT    |
1234567|2020-04-05 09:06:47|ADSL_PORT    |
1234567|2020-04-06 09:06:57|VDSL_PORT    |
1234567|2020-04-07 09:13:32|VDSL_PORT    |

What I need is to group DSL_CARD_TYPE and create a new column called Next_COLLECTION_DATE to get the next DSL_CARD_TYPE like below

LINE_ID|COLLECTION_DATE    |Next_COLLECTION_DATE  |DSL_CARD_TYPE|
-------|-------------------|----------------------|-------------|
1234567|2020-03-25 08:46:08|2020-03-26 08:31:48   |ADSL_PORT    |  
1234567|2020-03-27 08:42:40|2020-03-30 08:50:15   |VDSL_PORT    |
1234567|2020-03-31 08:34:53|2020-04-02 08:44:11   |ADSL_PORT    |
1234567|2020-04-03 08:43:51|2020-04-03 08:43:51   |VDSL_PORT    |   
1234567|2020-04-04 08:54:33|2020-04-05 09:06:47   |ADSL_PORT    |  
1234567|2020-04-06 09:06:57|2020-04-07 09:13:32   |VDSL_PORT    | 
  

I have created a very dummy and complex query to do the job, but with this huge amount of data it takes hours

COALESCE (lead (COLLECTION_DATE) OVER (PARTITION BY Line_ID ORDER BY COLLECTION_DATE),NOW() )Next_Collection_Date,
DSL_CARD_TYPE 
FROM (
SELECT * FROM (
SELECT
    LINE_ID, COLLECTION_DATE, 
    DSL_CARD_TYPE , 
    lead (DSL_CARD_TYPE) OVER (PARTITION BY Line_ID ORDER BY COLLECTION_DATE) To_Sync_Port, 
    lag (DSL_CARD_TYPE) OVER (PARTITION BY Line_ID ORDER BY COLLECTION_DATE) B_Sync_Port
FROM
    ANALYTICS.tmp.V_PORTS_LINE_CARD_DATA_ALL
    WHERE SYNC_PORT <>  TO_SYNC_PORT OR B_Sync_Port IS NULL )abc2```

1 Answers1

1

This looks like a gaps-and-islands problem, which in this case is probably best solved using the difference of row numbers:

select line_id, dsl_card_type, min(collection_date), max(collection_date)
from (select v.*,
             row_number() over (partition by line_id order by collection_date) as seqnum,
             row_number() over (partition by line_id, dsl_card_type order by collection_date) as seqnum_2
      from ANALYTICS.tmp.V_PORTS_LINE_CARD_DATA_ALL v
      where collection_date >= '2020-07-27 00:00:00'
     ) v
group by line_id, dsl_card_type, (seqnum - seqnum_2);

It is a little tricky to explain how this works. If you run the subquery, you can see how the difference between the two row numbers defines the adjacent rows with the same card type.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • please there is a chance for missing the collection date in the first table so, we could set the next_collection_date to be the first date of the next DSL_CARD_TYPE, not the last date of the adjacent DSL_CARD_TYPE. the first row is like the below: Line_ID: 1234567 >> COLLECTION_DATE: 2020-03-25 08:46:08 >> Next_COLLECTION_DATE: 2020-03-27 08:42:40 >> DSL_CARD_TYPE: ADSL_PORT – Ahmed Mohammed Abdel Kader Sep 09 '20 at 11:23
  • 1
    @AhmedMohammedAbdelKader . . . That would be a different question. This question is specifically designed to return the last date in each group. Ask a *new* question (and be sure to explain the difference!). – Gordon Linoff Sep 09 '20 at 12:53