1

I have the following data

Date         value_idx
2002-01-31    .
2002-01-31   24.533
2002-01-31   26.50
2018-02-28   25.2124
2019-09-12   22.251
2019-01-31   24.214
2019-05-21   25.241
2019-05-21    .
2020-05-21   25.241
2020-05-21   23.232

I would need to calculate the average of value_idx of the last 3 years and 7 years. I tried first to calculate it as follows:

proc sql;
create table table1 as
select date, avg(value_idx) as avg_value_idx
from table
group by date;
quit;

The problem is that I do not know how to calculate the average of value_idx not per each month but for the last two years. So I think I should extract the year, group by that, and then calculate the average. I hope someone of you can help me with this.

  • 1
    Please explain what you want more clearly. Could you provide results for the sample data you posted. If you just want the average for the 7 years up to today just use a WHERE clause to filter to observations with date >= '01JAN2013'd . – Tom Nov 09 '20 at 16:40
  • I would need to consider the average of value_idx calculating it through the last two years so 2019 and 2018). –  Nov 09 '20 at 16:48
  • Do you just want two values for the whole dataset? If so why is your example attempted code including the DATE variable in the result set? If not then how many results do you want? – Tom Nov 09 '20 at 17:00
  • it should be just one value, as I should take the average for 2018 and 2019 (last two years), starting from the most recent year (that in the dataset is 2020). –  Nov 09 '20 at 17:08
  • This is possible in `PROC SQL`, but I discourage using that for things like this - if you have a reason that you must use `PROC SQL` specifically, amend the post to include that. – Joe Nov 09 '20 at 19:13

2 Answers2

1

The best way to do this sort of thing in SAS is with native PROCs, as they have a lot of functionality related to grouping.

In this case, we use multilabel formats to control the grouping. I assume you mean 'Last Three Years' as in calendar 2018/2019/2020 and 'Last Seven Years' as calendar 2014-2020. Presumably you can see how to modify this for other time periods - so long as you aren't trying to make the time period relative to each data point.

We create a format that uses the MULTILABEL option (which allows data points to fall in multiple categories), and the NOTSORTED option (to allow us to force the ordering of the labels, otherwise SEVEN is earlier than THREE).

Then, we use it in PROC TABULATE, enabling it with MLF (MultiLabel Format) and preloadfmt order=data which again keeps the ordering correct. This produces a report with the two averages only.

data have;
informat date yymmdd10.;
input Date value_idx;
datalines;
2002-01-31    .
2002-01-31   24.533
2002-01-31   26.50
2017-02-28   25.2124
2017-09-12   22.251
2018-01-31   24.214
2018-05-21   25.241
2019-05-21    .
2020-05-21   25.241
2020-05-21   23.232
;;;;
run;

proc format;
  value yeartabfmt (multilabel notsorted)
    '01JAN2018'd-'31DEC2020'd = 'Last Three Years'
    '01JAN2014'd-'31DEC2020'd = 'Last Seven Years'
     other=' '
   ;
quit;

proc tabulate data=have;
  class date/mlf preloadfmt order=data;
  var value_idx;
  format date yeartabfmt.;
  tables date,value_idx*mean;
run;
Joe
  • 62,789
  • 6
  • 49
  • 67
1

You can use CASE to decide which records contribute to which MEAN. You need to clarify what you mean by last 2 or last 7 years. This code will find the value of the maximum date and then compare the year of that date to the year of the other dates.

select 
 mean(case when year(max_date)-year(date) < 2 then value_idx else . end) as mean_yr2 
,mean(case when year(max_date)-year(date) < 7 then value_idx else . end) as mean_yr7
from have,(select max(date) as max_date from have)
;

Results

mean_yr2  mean_yr7
------------------
 24.0358   24.2319
Tom
  • 47,574
  • 2
  • 16
  • 29