2

I have a large dataset named Planes with missing values in Arrival Delays(Arr_Delay).I want to Replace those missing values by Average delay on the Specific route(Origin - Dest) by Specific Carrier.

Hereby is the sample of the dataset : -

date      carrier  Flight    tailnum    origin  dest  Distance Air_time Arr_Delay

01-01-2013  UA     1545       N14228    EWR      IAH    1400    227        17
01-01-2013  UA     1714       N24211    LGA      IAH    1416    227        .
01-01-2013  AA     1141       N619AA    JFK      MIA    1089    160        .
01-01-2013  EV     5708       N829AS    LGA      IAD    229     53        -18
01-01-2013  B6     79         N593JB    JFK      MCO    944     140        14
01-01-2013  AA     301        N3ALAA    LGA      ORD    733     138        .
01-01-2013  B6     49         N793JB    JFK      PBI    1028    149        .
01-01-2013  B6     71         N657JB    JFK      TPA    1005    158        19
01-01-2013  UA     194        N29129    JFK      LAX    2475    345        23
01-01-2013  UA     1124       N53441    EWR      SFO    2565    361       -29

code I tried : -

 Proc stdize data=cs1.Planes reponly method=mean out=cs1.Complete_data;
 var Arrival_delay_minutes;                                                                                                                                                                                                                                     
 Run; 

But as my problem states..i want to get the mean by Specific Route and Specific Carrier for the Missing Value. Please help me on this!

Aman kashyap
  • 143
  • 1
  • 3
  • 12
  • 1
    The technical term for *replace missing values* is [imputation](https://en.wikipedia.org/wiki/Imputation_(statistics)). Procedures that can deal with group processing (also known as `BY`) processing will have a `BY` statement available for use in coding. – Richard May 09 '18 at 11:51

2 Answers2

1

stdize Procedure does not have a way to include by or class variables. you can use the below code to complete your task:-

Proc means data=cs1.Planes  noprint;
 var Arr_Delay;
 class carrier origin dest;
 output out=mean1;
Run;

proc sort data=cs1.Planes;
  by carrier origin dest;
run;

proc sort data=mean1;
  by carrier origin dest;
run;

data cs1.Complete_data(drop=Arr_Delay1 _stat_);
  merge cs1.Planes(in=a) mean1(where=(_stat_="MEAN")
                    keep=carrier origin dest Arr_Delay _stat_
                    rename=(Arr_Delay = Arr_Delay1) in=b);
  by carrier origin dest;
  if a;
  if Arr_Delay =. then Arr_Delay=Arr_Delay1;
run;
kawsleo
  • 560
  • 4
  • 23
1

You just need to sort the table cs1.Planes by origin, dest & carrier before running Proc stdize and add by origin dest carrier; to do the grouping you wanted. The only case the values will remain missing is when there are no other values for this carrier/route.

You can find the SAS documentation here and available options here.

Code:

data have;
informat date ddmmyy10.;
format date ddmmyy10.;
input 
date      carrier $ Flight    tailnum $   origin $  dest $  Distance Air_time Arr_Delay;
datalines;
01-01-2013  UA     1545       N14228    EWR      IAH    1400    227        17
01-01-2013  UA     1714       N24211    LGA      IAH    1416    227        .
01-01-2013  AA     1141       N619AA    JFK      MIA    1089    160        .
01-01-2013  EV     5708       N829AS    LGA      IAD    229     53        -18
01-01-2013  B6     79         N593JB    JFK      MCO    944     140        14
01-01-2013  AA     301        N3ALAA    LGA      ORD    733     138        .
01-01-2013  B6     49         N793JB    JFK      PBI    1028    149        .
01-01-2013  B6     49         N793JB    JFK      PBI    1028    149        15
01-01-2013  B6     71         N657JB    JFK      TPA    1005    158        19
01-01-2013  UA     194        N29129    JFK      LAX    2475    345        23
01-01-2013  UA     1124       N53441    EWR      SFO    2565    361       -29
;
run;

proc sort data=work.have; by origin  dest carrier; run;

Proc stdize data=work.have reponly method=mean out=work.Complete_data ;
 var Arr_Delay;  
 by  origin  dest carrier ;
 Run; 
momo1644
  • 1,769
  • 9
  • 25