How can I sum the unique values in a variable by the observations in another variable?

Question

I have a dataset containing information on hospital visits for certain years. One of the variables contains county codes for counties that the visits occurred in and another variable contains the year for each visit that occurred. What I'm trying to do is sum how many unique counties hospital visits occurred in per year. I have access to SAS code which does this and I'm trying to duplicate it somehow in R.

The SAS code is:

proc sql;
    select ED_VISIT_YEAR, count(distinct COUNTY_CD) as Counties from dataset
    group by ED_VISIT_YEAR;
quit; run;

I have tried using the aggregate and unique functions along with sum, but I always get errors.

The only thing I have got to work is this:

Check <- as.matrix(unique(ED_a$COUNTY_CD,ED_a$ED_VISIT_YEAR))

Which produces outcome like this:

what I'm looking for is something like this:

ED_VISIT_YEAR  Counties
2005              16
2006              16
2007              16
2008              16

I'm still fairly new to R and this has me stumped so any help would be greatly appreciated.

can you provide a reproducible code example (with counties variables, years and so on)? — Carbo, Jan 29 '20 at 16:43

iod · Accepted Answer · 2020-01-29T16:58:37.860

1

Using dplyr:

ED_a %>% group_by(ED_VISIT_YEAR) %>% tally(name="counties")

If you don't want to use dplyr, a base solution could be this:

countysummaries<-as.data.frame(counties=sapply(split(ED_a,ED_a$ED_VISIT_YEAR),nrow)

Note that this will give you the years as rownames rather than a column, so you might want to do this afterwards:

countysummaries$ED_VISIT_YEAR<-rownames(countysummaries)

(the tidyverse also has rownames_to_column, but here we're looking for a base only solution).

edited Jan 29 '20 at 16:58

answered Jan 29 '20 at 16:44

iod

7,412
2
17
36

Thank you. Using your dplyr example put me on the right track. The only thing different I did was for the output I'm looking for I added the distinct function into the line of code. Like this: ED_a %>% group_by(ED_VISIT_YEAR) %>% distinct(COUNTY_CD) %>% tally(name="Counties) – Tj Pierce Jan 29 '20 at 17:57

score 0 · Answer 2 · answered Jan 29 '20 at 17:16

In SQL, to replicate COUNT(DISTINCT ...) for SQL dialects that do not support this type of aggregate such as MS Access SQL, you have to first return the distinct records at one level, then count returned rows at the final level:

SELECT Year, Count(*) As distinct_counties
FROM (SELECT DISTINCT Year, CountyCD
      FROM myTable) sub
GROUP BY Year

Therefore in base R, run aggregate after unique call (cbind to rename column):

aggregate(cbind(Counties=COUNTY_CD) ~ ED_VISIT_YEAR, 
          data = unique(ED_a$COUNTY_CD, ED_a$ED_VISIT_YEAR), 
          FUN = length)

How can I sum the unique values in a variable by the observations in another variable?

2 Answers2