Proc hpbin with minimum proportion per bin

Question

I am using Proc HPBIN to split my data into equally-spaced buckets i.e. each bucket has an equal proportion of the total range of the variable.

My issue is when I have extremely skewed data with a large range. Almost all of my datapoints lie in one bucket while there is a couple of observations scattered around the extremes.

I'm wondering if there is a way to force PROC HPBIN to consider the proportion of values in each bin and make sure there is at least e.g. 5% of observations in a bin and to group others?

DATA var1;
    DO VAR1 = 1 TO 100;
        OUTPUT;
    END;
    DO VAR1 = 500 TO 505;
        OUTPUT;
    END;
    DO VAR1 = 7000 TO 7015;
        OUTPUT;
    END;
    DO VAR1 = 1000000 TO 1000010;
        OUTPUT;
    END;
RUN;

/*Use proc hpbin to generate bins of equal width*/
ODS EXCLUDE ALL;
ODS OUTPUT
    Mapping = bin_width_results;
PROC HPBIN
    DATA=var1
    numbin = 15
    bucket;
    input VAR1 / numbin = 15;
RUN;
ODS EXCLUDE NONE;

Id like to see a way that proc hpbin or other method groups together the bins which are empty and allows at least 5% of proportion per bucket. However, I am not looking to use percentiles in this case (it is another plot on my pdf) because I'd see like to see the spread.

Would 1 bin containing 100% of the data qualify as *at least 5%* — Richard, Apr 15 '19 at 13:34

Richard · Answer 1 · 2019-04-16T12:43:22.530

Quantile option and 20 bins should give you ~5% per bin

PROC HPBIN DATA=var1 quantile;
    input VAR1 / numbin = 20;
RUN;

When the values of a bin need to be dynamically rebinned due overly high proportions in a bin (problem bins) you need to hpbin only those values in the problem bins. A macro can be written to loop around the HPBIN process, zooming in on problem areas.

For example:

DATA have;
    DO VAR1 = 1 TO 100;
        OUTPUT;
    END;
    DO VAR1 = 500 TO 505;
        OUTPUT;
    END;
    DO VAR1 = 7000 TO 7015;
        OUTPUT;
    END;
    DO VAR1 = 1000000 TO 1000010;
        OUTPUT;
    END;
RUN;

%macro bin_zoomer (data=, var=, nbins=, rezoom=0.25, zoomlimit=8, out=);

  %local data_view step nextstep outbins zoomers;

  proc sql;
    create view data_zoom1 as
    select 1 as step, &var from &data;
  quit;

  %let step = 1;
  %let data_view = data_zoom&step;
  %let outbins = bins_step&step;

%bin:
  %if &step > &zoomlimit %then %goto done;

  ODS EXCLUDE ALL;
  ODS OUTPUT Mapping = &outbins;
  PROC HPBIN DATA=&data_view bucket ;
    id step;
    input &var / numbin = &nbins;
  RUN;
  ODS EXCLUDE NONE;

  proc sql noprint;
    select count(*) into :zoomers trimmed
    from &outbins
    where proportion >= &rezoom
  ;

  %put NOTE: &=zoomers;

  %if &zoomers = 0 %then %goto done;

  %let step = %eval(&step+1);

  proc sql;
    create view data_zoom&step as
    select &step as step, *
    from &data_view data
    join &outbins   bins
    on data.&var between bins.LB and bins.UB
       and bins.proportion >= &rezoom
    ;
  quit;

  %let outbins = bins_step&step;
  %let data_view = data_zoom&step;

  %goto bin;

%done:

  %put NOTE: done @ &=step;

  * stack the bins that are non-problem or of final zoom;
  * the LB to UB domains from step2+ will discretely cover the bounds
  * of the original step1 bins;
  data &out;
    set 
      bins_step1-bins_step&step
      indsname = source
    ;
    if proportion < &rezoom or source = "bins_step&step";
    step = source;
  run;

%mend;

options mprint;

%bin_zoomer(data=have, var=var1, nbins=15, out=bins);

I imagine this is the same as percentiles (quantiles giving an equal-ish number of observations per bin) My thoughts are that I still want to have an equal width per bin it's just that the extremes are grouped together such that I essentially zoom in on that bin with 90% of observations in it. I.e. I want to see the distribution of the obs in that bin — 78282219, Apr 15 '19 at 14:37
If you zoom in on a particular domain to see the distribution within a bin the bin size changes. If you have a fixed number of bins the proportion in each bin will vary; and you can know what the proportion is via the output data. You would need a second step to identify the domains (bins) you want to zoom into and have additional criteria on how to break those bins into finer detail, again with fixed # of bins or quantiles. — Richard, Apr 15 '19 at 16:06
I thought so, I will get to work on this. I am reading your answer, I am going to test it out! — 78282219, Apr 16 '19 at 08:34

score 1 · Answer 2 · answered Apr 15 '19 at 15:59

1

Have you tried using the WINSOR method (winsorised binning)? From the documentation:

Winsorized binning is similar to bucket binning except that both tails are cut off to obtain a smooth binning result. This technique is often used to remove outliers during the data preparation stage.

You can specify the WINSORRATE to impact how it adjusts these tails.

answered Apr 15 '19 at 15:59

Joe

62,789
6
49
67

I will look into this! I am already removing 5% of the data by trimming the percentiles. However, some of my factors are ratios which become unstable when the denominator is small – 78282219 Apr 16 '19 at 08:35
This makes it more a data quality issue but difficult to manage – 78282219 Apr 16 '19 at 08:35

Proc hpbin with minimum proportion per bin

2 Answers2