0

In case it matters for the following will use a Netezza backend + SPSS Modeler and/or Advanced Query Tool for the query itself. I have no access to the CLI. I'm trying to understand if cursors and walking through a sorted table is necessary to handle the following:

Imagine a table with 2 columns, the first is a non-unique ID and the second is a date. Any given ID may occur in the table multiple times with one or more dates.

My goal is to select from this table rows for which the dates are spaced out by no less than a fixed number of days, say 90. For example:

| ID |  DATE      |
===================
  X    2014-01-01
  X    2014-02-01
  X    2014-07-01
  Y    2014-02-01
  Y    2014-06-01
  Y    2014-07-01

In the above example, the rows I want to select for X would be Jan 1 and Jul 1 (exclude Feb 1 because it is less than 90 days from Jan 1) and the rows for Y would be Feb 1 and Jun 1 (exclude Jul 1 because it's within 90 days of a prior case.

In practice there could be well over 100M rows in the table. Is it possible to do this without cursors? What would the optimum method be?

Thanks in advance for any advice!

EDIT: Expanded the test table data here. SQL Fiddle

In the above edited example, the desired output would be

| ID |  DATE      |
===================
  X    2014-01-01
  X    2014-04-01
  X    2014-10-01
  Y    2014-01-15
  Y    2014-04-15
  Y    2014-10-15
  Z    2014-01-01
  Z    2014-04-01
  Z    2014-10-01
funkyb
  • 17
  • 3
  • So when you say spaced out by no less than 90 days, that is 90 days from the lowest date for a particular ID? – shree.pat18 Jul 31 '14 at 02:55
  • What if you have a row X 2014-10-01, for example? What rows would you want to be selected in this case? – Nizam Jul 31 '14 at 02:55
  • Good questions - Shree.pat18) yes in the example I posted i'd like to start from the lowest for a given ID but in actual implementation we might need to do the opposite and start from the highest. I just picked one direction for purpose of this example. Nizam) If there was a row X 2014-10-01 then the rows selected with ID=X would be Jan 1, Jul 1 and Oct 1, as Oct 1 is more than 90 days from 2014-07-01. – funkyb Jul 31 '14 at 03:02
  • I think you could remove sql-server and mysql tags. Will you accept answer that only works in sql-server for example? Using CTE recursions, CROSS APPLY, etc...? – Nizam Jul 31 '14 at 04:09
  • Is it possible to use variable tables (sql-server)? – Nizam Jul 31 '14 at 04:10
  • If you have 1-1-14, 2-1-14, 3-1-14, 4-1-14, 5-1-14 and 10-1-14, which records should be selected? Can you give a definite answer? – Kevin Cook Jul 31 '14 at 16:10

2 Answers2

-1

If you accept something that works in SQL Server, then the following code will work:

With CTE as (
    select A.ID, A.DATA, MIN(B.DATA) DATA1 
    from Table1 A
    inner join Table1 B
      on A.ID = B.ID
      and DATEADD(DAY, 90, A.DATA) <= B.DATA
    GROUP BY A.ID, A.DATA
), REC AS (
   SELECT ID, MIN(DATA) DATA
   FROM Table1
   GROUP BY ID
   UNION ALL
   SELECT A.ID, B.DATA1
   FROM REC A
   INNER JOIN CTE B
     ON A.ID = B.ID
     AND A.DATA = B.DATA
)

SELECT *
FROM REC
ORDER BY ID, DATA

It user recursion of CTE. By choosing the minimum date for each ID, it follows by recursion taking always the minimum date which is bigger than 90 days. But this will only works in SQL Server.

SQL Fiddle

UPDATE

As you are getting ideas to implement somewhere else, it is interesting to have more than one way to implement it. In SQL Server is also possible to implement this way in TSQL:

DECLARE @TABLE1 TABLE (ID VARCHAR(1), DATA DATE, DATA1 DATE)

  INSERT INTO @TABLE1
    select A.ID, A.DATA, MIN(B.DATA) DATA1 
    from (
       SELECT ID, MIN(DATA) DATA
       FROM Table1
       GROUP BY ID
    ) A
    inner join Table1 B
      on A.ID = B.ID
      and DATEADD(DAY, 90, A.DATA) <= B.DATA
    GROUP BY A.ID, A.DATA

DECLARE @AUX INT = 0

WHILE (SELECT COUNT(*) FROM @TABLE1) <> @AUX
BEGIN

  SELECT @AUX = COUNT(*) FROM @TABLE1

  INSERT INTO @TABLE1
    select *
    from ( 
       select A.ID, A.DATA, MIN(B.DATA) DATA1 
       from @Table1 A
       inner join Table1 B
         on A.ID = B.ID
         and DATEADD(DAY, 90, A.DATA1) <= B.DATA
       GROUP BY A.ID, A.DATA    
    ) A
    where not exists (
        SELECT 1
        FROM @TABLE1
        WHERE ID = A.ID
         AND DATA = A.DATA
         AND DATA1 = A.DATA1
    )

END

SELECT ID, DATA
FROM @TABLE1
UNION
SELECT ID, DATA1
FROM @TABLE1
UNION
SELECT ID, MIN(DATA) DATA
FROM Table1
GROUP BY ID

SQL Fiddle

Nizam
  • 4,569
  • 3
  • 43
  • 60
  • Thanks for the feedback. I'm rather new to netezza and having something to go on in any implementation is at least a place to start...I'm checking to the NZSQL docs now to see if it support anything similar. – funkyb Jul 31 '14 at 05:01
  • The first set of sql code in your answer doesn't find gaps, it only finds records more than 90 days apart, if you pass records of 1-1-2014, 2-1-2014, 3-1-2014, 4-1-2014, 5-1-2014 it will flag 1-1-2014 and 5-1-2014 even though the gap between 4-1 and 5-1 is less than 90 days. – Kevin Cook Jul 31 '14 at 13:56
  • The second set of your code has the same problem, does not handle gaps at all. – Kevin Cook Jul 31 '14 at 14:03
  • It will flag 2014-01-01 and 2014-04-01, just as OP specified (see http://sqlfiddle.com/#!3/4a5691/1) – Nizam Jul 31 '14 at 14:05
  • I'm working on it. So far it looks like NZSQL doesn't support recursion via CTE but I agree in the SQLFiddle provided that solution does seem to produce the correct results. Analyzing the second solution now and will report back as soon as I can. I'd vote but I've never used SE before (created for this post) and am below the 15 noob-check. – funkyb Jul 31 '14 at 14:34
  • Thanks. To vote, IF YOU WANT, you just need to click the up arrow or the down arrow near the reputation of the question. If you think an answer solved your problem, you just need to click the check near the question reputation. – Nizam Jul 31 '14 at 14:39
  • I tried that already, SO response "vote requires 15 reputation". That said, I don't know why you're at negative numbers -> your solution does appear to produce the desired result and I'm just trying to make replicate it in NZSQL. The other solution posted by someone else does not produce the correct result. I'll be back - just waiting on closing this until I've found a method for NZSQL. If I can't, I'll close it anyway selecting yours are the solved one as it does model the correct process. – funkyb Jul 31 '14 at 14:45
  • So is the op looking for gaps in the records of > 90 days or just any records 90 days apart starting from the beginning? Because I look at the initial post and it seems he was looking to break up the data based on a 90 day gap. – Kevin Cook Jul 31 '14 at 16:06
  • @KevinCook I would resume the algorithm this way for a specific ID: -> 1. Find the minimum date / 2 -> from this date, find the minimum date >90 days / 3 -> from date found in 2, find the minimum date >90 days / 4 -> keep going... – Nizam Jul 31 '14 at 17:26
  • @kevincook - it's not between any two records, the order of dates matters. psuedologic: Identify first date for given ID, keep that date as "SEED", continue through dates for that ID sequentually and throw out all subsequent dates until you encounter a date >= SEED+90. Once you hit a date for the same id that meets this condition, keep that date and reset SEED to be this new date. Now repeat until you finish that ID; then repeat for next ID, and so on. – funkyb Jul 31 '14 at 17:27
  • Sadly I've determined that this doesn't seem to work in NZSQL short of using the FOR loops in a procedure w/ NZPLSQL (and I'm banging on the DBA's door to see if I can get auth to do that). Nothing in SPSS modeler v15 or the access I have via AQT supports it (and NZSQL does not support WITH statements with self-referencing selects). I've click the answer checkbox because the posted solutions do work on platforms that support them, and am seeking to get SPSS v16 which has support for recursion. – funkyb Jul 31 '14 at 23:31
-1

I found an SPSS/Netezza native iterative solution that was successful. SPSS supports the @OFFSET(Field, integer) function that can read ahead or behind. I had attempted to use this function previously but encountered errors related to recursion when attempting to use "field" equal to the same field the result of the function goes along with a negative integer to read the previous result.

Today working on another project I discovered the documentation for @OFFSET() is poor, and while I had believe it to function from first-to-last row order w/ positive integers representing read-ahead, that is actually backwards. It actually is solved last-row-to-first and positive integer offsets actually imply reading previous rows. A retry using my original method and correcting for the sign on the integer offset eliminated the recursion error and solved the problem.

The actual solution can be though of in this way. This description is overly verbose for the sake of clarity, in practice most of these items can be condensed into the same step, and it ignores the possibility of an ID having multiple instances of the same date (which isn't hard to handle using extra logic in the comparison but wasn't necessary for my needs).

  1. Select a sorted set on ID then Date.
  2. Define a new column PR_ID by @OFFSET(ID, 1) to store the previous row's ID
  3. Define a new column called LST_CNT_DT (last countable date) containing the current row's date if PR_ID <> ID. Otherwise, compare the difference in days between the current row's date and @OFFSET(LST_CNT_DT,1) [i.e. the prior row's value for the same field], if the difference is >=90, store the current row's date. Otherwise store @OFFSET(LST_CNT_DT,1).
  4. Select from this new set all rows where LST_CNT_DT = DT (where DT is date for current row).

Not quite as elegant as the CTE recursion method available in MsSQL, but can be built entirely in SPSS v15 and it executed on the full table in a few minutes.

funkyb
  • 17
  • 3