SAS proc sql SELECT DISTINCT fails to remove duplicates on already sorted dataset

Question

Given the following test data:

data test;
input A B;
cards;
1 2
1 1
1 2
run;
NOTE: The data set WORK.TEST has 3 observations and 2 variables.

I am aware that proc sort can behave unexpectedly if you don't sort by a whole key, or even when you appear to sort by the whole key but there is a keep statement in force:

proc sort data=test out=test_dedup_works nodup;
    by a _all_;
run;
NOTE: There were 3 observations read from the data set WORK.TEST.
NOTE: Duplicate BY variable(s) specified. Duplicates will be ignored.
NOTE: 1 duplicate observations were deleted.
NOTE: The data set WORK.TEST_DEDUP_WORKS has 2 observations and 2 variables.

proc sort data=test out=test_dedup_fails nodup;
    by a;
run;
NOTE: There were 3 observations read from the data set WORK.TEST.
NOTE: 0 duplicate observations were deleted.
NOTE: The data set WORK.TEST_DEDUP_FAILS has 3 observations and 2 variables.


proc sort data=test (keep=a) out=test_dedup_alsofails nodup;
    by a;
run;
NOTE: There were 3 observations read from the data set WORK.TEST.
NOTE: 0 duplicate observations were deleted.
NOTE: The data set WORK.TEST_DEDUP_ALSOFAILS has 3 observations and 1 variables.

What's new to me is that trying to deduplicate the resulting not-actually-deduplicated dataset using PROC SQL also fails to remove duplicates:

proc sql;
    create table test_dedup_eventhisfails as
        select distinct a
        from test_dedup_alsofails;
quit;
NOTE: Table WORK.TEST_DEDUP_EVENTHISFAILS created, with 3 rows and 1 columns.

Is this a bug which is documented somewhere or am I doing something wrong?

I don't get that result in 9.3 or 9.4. What version of SAS are you using? I do match your `nodup` result just using a, but I consistently get one record in the SQL query in both versions. — Joe, May 14 '14 at 14:53
In any event, why use `nodup` when it seems you need `nodupkey` instead? — Joe, May 14 '14 at 14:54
I'm using SAS 9.1.3. If this is a bug that has been fixed, it would be useful to have a link to a SAS problem note or similar. I know I can work around the problem, but I want to understand it too. — jl6, May 14 '14 at 15:01
I tried, and also for me no problem. SAS 9.2 here. (Did you make sure that at the time of runnning the proc sql there was no other data in dataset test_dedup ? ) — Yoh, May 14 '14 at 15:17
I did a search for support tickets and didn't find any. You should probably contact support yourself if you're really concerned about this, they can likely tell you for sure. In any event I would just avoid using `nodup` (or `noduprecs`, which is the proper term) and use `nodupkey` - I've never found a good use for `nodup` in my career; it's very rare that it has value. — Joe, May 14 '14 at 15:20
The SAS documentation says "Because NODUPRECS checks only consecutive observations, some nonconsecutive duplicate observations might remain in the output data set. You can remove all duplicates with this option by sorting on all variables." which appears to be consistent with your results. See also the SORTDUP= system option. — Amir, May 15 '14 at 11:47
I cannot replicatie the sql problem. Try running the proc sql; with _method and _tree options to see what execution SAS is attemptions (proc sql _method _tree; ... quit;). — Laurent de Walick, May 20 '14 at 09:11

SAS proc sql SELECT DISTINCT fails to remove duplicates on already sorted dataset

0 Answers0