I am dealing with a large dataset (30 million rows) and I need to pull the most recent three dates (which may have an indeterminate number of rows attached to them) so like 03MAR2016 might have 2 rows 27FEB2016 might have ten and 25FEB2016 might have 3. How do I say "Select everything that falls within the last X number of values in this set regardless of how many rows there are"?
-
it will depend on your database structure, what have you tried so far? – ArturoAP Mar 10 '16 at 15:35
-
This is extremely hard to do in native SAS SQL. I would suggest that you use a data step, with `retain`. – Gordon Linoff Mar 10 '16 at 15:41
-
@GordonLinoff I imagine a data step wouldn't be very good at this either, unless you used a DoW loop, which seems like overkill? – Joe Mar 10 '16 at 15:48
-
@ArturoAP I've tried limiting the number of outputs in descending order and I can get a somewhat hamfisted result that gives me more data than I need but getting to the actual result I need is proving to be a real challenge. Also please pardon if I am not giving a good answer, I am new to this – Kumachorou Mar 10 '16 at 16:11
2 Answers
You need to break this down into two tasks.
- Determine which dates are the last three dates
- Pull all rows from those dates
Both are possible in SQL, though the first is much easier using other methods (SAS's SQL isn't very good at getting the "first X things").
I would suggest using something like PROC FREQ
or PROC TABULATE
to generate the list of dates (just a PROC FREQ on the date variable), really any proc you're comfortable with - even PROC SORT
would work (though that's probably less efficient). Then once you have that table, limit it to the three highest observations, and then you can use it in a SQL step to join to the main table and filter to those three dates - or you can use other options, like creating a custom format or hash tables or whatever works for you. 30 million rows isn't so many that a SQL join should be a problem, though, I'd think.

- 62,789
- 6
- 49
- 67
As you can not sort in an in-line view/subquery you will have to split your SQL statement in two parts:
- Sort the date DESCENDING and get the distinct values
- Join back to the original data and limit to first 3
But as stated before, SQL is not good at this kind of operation.
DATA input_data ;
INPUT date value ;
CARDS ;
20160101 1
20160101 2
20160101 3
20160102 1
20160103 1
20160104 1
20160105 1
20160105 2
20160105 3
;
proc sql _method;
create table DATE_ID as
select distinct DATE
from input_data
order by DATE DESC;
create table output_data as
select data.*
from (select *
from DATE_ID
where monotonic() <= 3
) id
inner join input_data data
on id.DATE = data.DATE
;
quit;