2

I am dealing with a large dataset (30 million rows) and I need to pull the most recent three dates (which may have an indeterminate number of rows attached to them) so like 03MAR2016 might have 2 rows 27FEB2016 might have ten and 25FEB2016 might have 3. How do I say "Select everything that falls within the last X number of values in this set regardless of how many rows there are"?

Kumachorou
  • 37
  • 3
  • it will depend on your database structure, what have you tried so far? – ArturoAP Mar 10 '16 at 15:35
  • This is extremely hard to do in native SAS SQL. I would suggest that you use a data step, with `retain`. – Gordon Linoff Mar 10 '16 at 15:41
  • @GordonLinoff I imagine a data step wouldn't be very good at this either, unless you used a DoW loop, which seems like overkill? – Joe Mar 10 '16 at 15:48
  • @ArturoAP I've tried limiting the number of outputs in descending order and I can get a somewhat hamfisted result that gives me more data than I need but getting to the actual result I need is proving to be a real challenge. Also please pardon if I am not giving a good answer, I am new to this – Kumachorou Mar 10 '16 at 16:11

2 Answers2

1

You need to break this down into two tasks.

  1. Determine which dates are the last three dates
  2. Pull all rows from those dates

Both are possible in SQL, though the first is much easier using other methods (SAS's SQL isn't very good at getting the "first X things").

I would suggest using something like PROC FREQ or PROC TABULATE to generate the list of dates (just a PROC FREQ on the date variable), really any proc you're comfortable with - even PROC SORT would work (though that's probably less efficient). Then once you have that table, limit it to the three highest observations, and then you can use it in a SQL step to join to the main table and filter to those three dates - or you can use other options, like creating a custom format or hash tables or whatever works for you. 30 million rows isn't so many that a SQL join should be a problem, though, I'd think.

Joe
  • 62,789
  • 6
  • 49
  • 67
1

As you can not sort in an in-line view/subquery you will have to split your SQL statement in two parts:

  1. Sort the date DESCENDING and get the distinct values
  2. Join back to the original data and limit to first 3

But as stated before, SQL is not good at this kind of operation.

DATA input_data ; 
  INPUT date value ; 
CARDS ; 
20160101 1
20160101 2
20160101 3 
20160102 1 
20160103 1
20160104 1
20160105 1
20160105 2
20160105 3
; 

proc sql _method;
create table DATE_ID as
    select distinct DATE
        from input_data
            order by DATE DESC;

create table output_data as
    select data.*
        from (select *
                from DATE_ID
                where monotonic() <= 3
                    ) id
        inner join input_data data
            on id.DATE = data.DATE
        ;
quit;
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Jetzler
  • 787
  • 3
  • 11