1

I would need to filter a dataset based on the higher value of var1 per each group.

I have this dataset:

Var1  t  avg
AA   1  0.02
AA   0  0.21
BB   1  0.05
BB   0  0.20
CC   1  0.10
CC   0  0.14

Built as follows

Proc sql;
    Select 
    Var1,
    t,
    Avg(var2) as avg
    From my_data
    Group by 1,2
    Order by 1,2; quit;

My expected output would be

    AA   0  0.21
    BB   0  0.20
    CC   0  0.14

Could you please give me some tips on how to do it? I think I should do something to rank them first, then select the observations having rank value = 1 (if in descending order)

  • 1
    I think you mean the higher value of the average . . . and the example is a little misleading because all the time values are `0`. – Gordon Linoff Sep 01 '20 at 14:42

3 Answers3

0

This is rather tricky in proc sql. I think this will do what you want:

proc sql;
    select Var1, t, avg(avg_var2) as average
    from my_data d
    group by 1
    having avg(avg_var2) = (select max(average)
                            from (select avg(avg_var2) as average
                                  from my_data d2
                                  where d2.var2 = d.var
                                  group by var1, t
                                 )
                           );
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks Gordon. I have got however no values. Did you get the same results? –  Sep 01 '20 at 14:59
  • @Math . . . This might be better done with a hybrid solution or saving the aggregated data to a dataset and then using that to get the max. – Gordon Linoff Sep 01 '20 at 15:25
0

From your input, add a next step using one of the approaches below.

You can use SQL with the HAVING clause to get the max.

proc sql;
create table want as
select var1, t, avg
from have 
group by var1
having avg=max(avg);
quit;

Or, assuming no duplicates are possible in the average value you can easily use a PROC SORT.

proc sort data=have;
by var1 descending avg;
run;

proc sort data=have out=want_highest nodupkey;
by var1;
run;
Reeza
  • 20,510
  • 4
  • 21
  • 38
0

You can compute the averages for each id,t group in a sub-query and from that select highest average over id group.

Example:

data raw;
  call streaminit(123);
  do id = 1 to 100;
    do t = 0 to rand('integer',1,2);
      do rep = 1 to rand('integer', 20);
        y = round(rand('uniform'), 0.01);
        output;
      end;
    end;
  end;
run;

proc sql;
  create table want as
  select id, t, avg
  from (
    select id, t, mean(y) as avg format=5.2
    from raw
    group by id, t
  ) as averages
  group by id
  having avg = max(avg)
  ;

Richard
  • 25,390
  • 3
  • 25
  • 38