0

I have a dataset like so

data test;
    do i = 1 to 100;
    x1 = ceil(ranuni(0) * 100);
    x2 = floor(ranuni(0) * 1600);
    x3 = ceil(ranuni(0) * 1500);
    x4 = ceil(ranuni(0) * 1100);
    x5 = floor(ranuni(0) * 10);
    output;
    end;
run;

data test_2;
    set test;

    if mod(x1,3) = 0 then x1 = .;
    if mod(x2,13) = 0 then x2 = .;
    if mod(x3,7) = 0 then x3 = .;
    if mod(x4,6) = 0 then x4 = .;
    if mod(x5,2) = 0 then x5 = .;
    drop i;
run;

I plan to calculate a number of percentiles including two non-standard percentiles (2.5th and 97.5th). I do this using proc stdize as below

PROC STDIZE 
    DATA=test_2
    OUT=_NULL_
    NOMISS 
    PCTLMTD=ORD_STAT
    pctldef=3
    OUTSTAT=STDLONGPCTLS
    pctlpts=(2.5 5 25 50 75 95 97.5);
    VAR _NUMERIC_;
RUN;

Comparing to proc means

DATA TEST_MEANS;
    SET TEST_2;
    IF NOT MISSING(X1);
    IF NOT MISSING(X2);
    IF NOT MISSING(X3);
    IF NOT MISSING(X4);
    IF NOT MISSING(X5);
RUN;

PROC MEANS 
    DATA=TEST_MEANS NOPRINT; 
    VAR _NUMERIC_;
    OUTPUT OUT=MEANSWIDEPCTLS P5= P25= P50= P75= P95= / AUTONAME;
RUN;

However, something to do with how SAS labels missing values as -inf, when I compare the results above, to the results produced in excel and proc means, they aren't aligned, can someone confirm which would be correct?

78282219
  • 159
  • 1
  • 12
  • This type of things occurs when you have a lot of ties. How many ties are in your data and how do they impact the results? – Reeza Nov 18 '18 at 19:28
  • And percentiles don't have a 'common' definition, so there isn't a right, there's a what's appropriate for you data and which one you want to use. Pick one and make it clear what definition you used. – Reeza Nov 18 '18 at 19:29

1 Answers1

1

You are using pctldef=3 in PROC STDIZE but the default definition for PROC MEANS, and that is 5. I tested your code with PCTLDEF=3 using PROC MEANS and get matching results.

data _null_
  • 8,534
  • 12
  • 14
  • Thanks for the clarification, what concerns me is the difference between the options (over 20 units in some cases). How can I trust this output and how can i distinguish which is most accurate? – 78282219 Nov 18 '18 at 16:54
  • 1
    @78282219 Just look at the definition of the algorithms used by the different options of the PCTLDEF= option and pick the one that you want to use. – Tom Nov 19 '18 at 04:12
  • I find that pctldef =5 replicates proc univariate and I am content – 78282219 Nov 19 '18 at 15:24