How to use summarize and group by to create a summary table in r?

Question

So I have the following data set (this is a small sample/example of what it looks like, with the original being 7k rows and 30 columns over 7 decades):

Year,Location,Population Total, Median Age, Household Total
2000, Adak, 220, 45, 67
2000, Akiachak, 567, NA, 98
2000, Rainfall, 2, NA, 11
1990, Adak, NA, 33, 56
1990, Akiachak, 456, NA, 446
1990, Tioga, 446, NA, NA

I want to create a summary table that indicates how many years of data is available by location for each variable. So something like this would work (for the small example from before):

Location,Population Total, Median Age, Household Total
Adak,1,2,2
Akiachak,2,0,2
Rainfall,1,0,1
Tioga,1,0,0

I'm new to R and haven't used these two commands together so I'm unsure of the syntax. Any help would be wonderful or alternatives.

You'll want to use the `dplyr` package. Start with `group_by(Location)` and then write your `summarise` or `count` statement. I suggest looking into a tutorial on using `dplyr` though. — Dave Gruenewald, Oct 04 '17 at 17:53
Also, you'll receive much better feedback on questions here if you supply an example of approaches you have tried. As is, some might interpret this as a homework question... — Dave Gruenewald, Oct 04 '17 at 17:54
Thank you! I'm thinking of switching everything that isnt NA to 1, Na to 0 and then aggregating the rows while taking away the years. But my boss is saying I should try using these two functions together. — jules, Oct 04 '17 at 18:06
Are you talking specifically about `group_by` and `summarize` in `dplyr`? If so, you might want to add `dplyr` as a tag — acylam, Oct 04 '17 at 18:10

tbradley · Answer 1 · 2017-10-04T18:08:03.010

1

You can do something like this:

x %>%
  group_by(Location) %>%
  summarise(count_years = n(), 
            count_pop_total = sum(!is.na(Population_Total)),
            count_median_age = sum(!is.na(Median_Age)),
            count_house_total = sum(!is.na(Household_Total)))

where you can replace the mean with whatever operation you want to perform. You should take a look at the dplyr vignette for more general solutions.

edited Oct 04 '17 at 18:08

answered Oct 04 '17 at 17:56

tbradley

2,210
11
20

I changed my answer to give what OP wanted using explicitly `group_by` and `summarise`, however, the answer given by @useR with `summarize_all` or `summarize_at` are probably better for this situation – tbradley Oct 04 '17 at 18:09

acylam · Accepted Answer · 2017-10-04T18:05:15.847

A solution with summarize_all from dplyr:

library(dplyr)
df %>%
  group_by(Location) %>%
  summarize_all(funs(sum(!is.na(.)))) %>%
  select(-Year)

Or you can use summarize_at:

df %>%
  group_by(Location) %>%
  summarize_at(vars(-Year), funs(sum(!is.na(.))))

Result:

# A tibble: 4 x 4
   Location PopulationTotal MedianAge HouseholdTotal
      <chr>           <int>     <int>          <int>
1      Adak               1         2              2
2  Akiachak               2         0              2
3  Rainfall               1         0              1
4     Tioga               1         0              0

Data:

df = read.table(text = "Year,Location,PopulationTotal, MedianAge, HouseholdTotal
                2000, Adak, 220, 45, 67
                2000, Akiachak, 567, NA, 98
                2000, Rainfall, 2, NA, 11
                1990, Adak, NA, 33, 56
                1990, Akiachak, 456, NA, 446
                1990, Tioga, 446, NA, NA", header = TRUE, sep = ",", stringsAsFactors = FALSE)

library(dplyr)    
df = df %>%
  mutate_at(vars(PopulationTotal:HouseholdTotal), as.numeric)

How to use summarize and group by to create a summary table in r?

2 Answers2