5

I am trying to calculate descriptive statistics for the birthweight data set (birthwt) found in RStudio. However, I'm only interested in a few variables: age, ftv, ptl and lwt.

This is the code I have so far:

library(MASS)
library(dplyr)
data("birthwt")

grouped <- group_by(birthwt, age, ftv, ptl, lwt)

summarise(grouped, 
          mean = mean(bwt),
          median = median(bwt),
          SD = sd(bwt))

It gives me a pretty-printed table but only a limited number of the SD is filled and the rest say NA. I just can't work out why or how to fix it!

alistaire
  • 42,459
  • 4
  • 77
  • 117
Angus
  • 65
  • 1
  • 3
  • Where does that go in the code? – Angus Jan 04 '18 at 03:13
  • 4
    The reason is that you have only a single observations for most of the cases i.e. `grouped %>% summarise(n = n())` and `sd` needs more than one observation or else it return NaN – akrun Jan 04 '18 at 03:13
  • I'm sorry I don't understand! – Angus Jan 04 '18 at 03:17
  • 1
    You can check `?sd` It is written `The standard deviation of a length-one vector is NA.` The number of elements in some of the group is 1. – akrun Jan 04 '18 at 03:18

2 Answers2

10

I stumbled here for another reason and also for me, the answer comes from the docs:

# BEWARE: reusing variables may lead to unexpected results
mtcars %>%
    group_by(cyl) %>%
    summarise(disp = mean(disp), sd = sd(disp))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 3
#>     cyl  disp    sd
#>   <dbl> <dbl> <dbl>
#> 1     4  105.    NA
#> 2     6  183.    NA
#> 3     8  353.    NA

So, in case someone has the same reason as me, instead of reusing a variable, create new ones:

mtcars %>%
group_by(cyl) %>%
summarise(
    disp_mean = mean(disp),
    disp_sd = sd(disp)
)

`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 3
    cyl disp_mean disp_sd
  <dbl>     <dbl>   <dbl>
1     4      105.    26.9
2     6      183.    41.6
3     8      353.    67.8
teppo
  • 542
  • 8
  • 11
2

The number of rows for some of the groups are 1.

grouped %>% 
     summarise(n = n())
# A tibble: 179 x 5
# Groups: age, ftv, ptl [?]
#     age   ftv   ptl   lwt     n
#   <int> <int> <int> <int> <int>
# 1    14     0     0   135     1
# 2    14     0     1   101     1
# 3    14     2     0   100     1
# 4    15     0     0    98     1
# 5    15     0     0   110     1
# 6    15     0     0   115     1
# 7    16     0     0   110     1
# 8    16     0     0   112     1
# 9    16     0     0   135     2
#10    16     1     0    95     1

According to ?sd,

The standard deviation of a length-one vector is NA.

This results in NA values for the sd where there is only one element

akrun
  • 874,273
  • 37
  • 540
  • 662
  • How is there only one element if i have used 4 variables? – Angus Jan 04 '18 at 03:24
  • 1
    @Angus You are grouping by `birthwt, age, ftv, ptl, lwt` and there is only single combination for some of these groups. You may need to revisit which variables you want to group. I think that is the problem here. Looks like `lwt` have unique values and could be omitted in the grouping – akrun Jan 04 '18 at 03:24
  • So I cannot actually get Standard Deviation values for those with NA? – Angus Jan 04 '18 at 03:26
  • 1
    @Angus There is no NA values. It is just that you have a single observation per group. If you want to change it to some values, you can do with an `if/else` condition i.e. `summarise(grouped, mean=mean(bwt), median=median(bwt), SD= if(n()>1) sd(bwt) else 0)` but I am not sure if that makes sense – akrun Jan 04 '18 at 03:28
  • 1
    I'm running into the same problem although n() > 1. I'm thinking it's a bug. – Peter Straka May 19 '20 at 14:14