0

I have table in SAS with missing values like below:

col1 | col2 | col3 | ... | coln
-----|------|------|-----|-------
111  |      | abc  | ... | abc
222  | 11   | C1   | ... | 11
333  | 18   |      | ... | 12
...  | ...  | ...  | ... | ...

And I need to delete from above table variables where is more than 80% missing values (>=80%).

How can I do taht in SAS ?

dingaro
  • 2,156
  • 9
  • 29
  • Check out this answer from Reeza: https://communities.sas.com/t5/SAS-Programming/Calculate-percent-missing-in-long-dataset/td-p/499995 – Stu Sztukowski Dec 01 '22 at 23:56
  • Stu Sztukowski I saw this post but it does not work for me, I do not know why, but it generate only 1 row in output, maybe do you have other solution ? :) – dingaro Dec 02 '22 at 00:04
  • Are all of the variables you want to potentially drop numeric? Or do you also need to test some that are character strings? – Tom Dec 02 '22 at 01:11

1 Answers1

2

The macro below will create a macro variable named &drop_vars that holds a list of variables to drop from your dataset that exceed missing threshold. This works for both character and numeric variables. If you have a ton of them then this macro will fail but it can easily be modified to handle any number of variables. You can save and reuse this macro.

%macro get_missing_vars(lib=, dsn=, threshold=);
    %global drop_vars;

    /* Generate a select statement that calculates the proportion missing:
       nmiss(var1)/count(*) as var1, nmiss(var2)/count(*) as var2, ... */   
    proc sql noprint;
        select cat('nmiss(', strip(name), ')/count(*) as ', strip(name) )
        into :calculate_pct_missing separated by ','
        from dictionary.columns
        where     libname = upcase("&lib")
              AND memname = upcase("&dsn")
        ;
    quit;

    /* Calculate the percent missing */
    proc sql;
        create table pct_missing as
            select &calculate_pct_missing.
            from &lib..&dsn.
        ;
    quit;

    /* Convert to a long table */
    proc transpose data=pct_missing out=drop_list;
        var _NUMERIC_;
    run;

    /* Get a list of variables to drop that are >= the drop threshold */
    proc sql noprint;
        select _NAME_
        into :drop_vars separated by ' '
        from drop_list
        where COL1 GE &threshold.
        ;
    quit;
%mend;

It has three parameters:

lib: Library of your dataset

dsn: Dataset name without the library

threshold: Proportion of missing values a variable must meet or exceed to be dropped

For example, let's generate some sample data and use this. col1 col2 col3 all have 80% missing values.

data have;
    array col[10];
    do i = 1 to 10;
        do j = 1 to 10;
            col[j] = i;
            if(i > 2 AND j in(1, 2, 3) ) then col[j] = .;
        end;
        
        output;
    end;

    drop i j;
run; 

We'll run the macro and check the log:

%get_missing_vars(lib=work, dsn=have, threshold=0.8);

%put &drop_vars;

The log shows:

col1 col2 col3

Now we can pass this into a simple data step.

data want;
    set have;
    drop &drop_vars;
run;
Stu Sztukowski
  • 10,597
  • 1
  • 12
  • 21
  • Nice work! I also suggest to use `%if %str(&drop_vars) %then %do; ...; %end;` to surround `drop ...` statement. – whymath Dec 02 '22 at 07:27