Why is the spread function in R adding NAs to my grouped and summarised dataframe?

Question

I'm applying the spread function from the tidyr package to a grouped dataset. Below is an example of my actual data. I can summarize the data using group_by and the summarise function in dplyr, which yields a small dataframe with no NAs. However, if I reformat the dataframe using spread, cells that once had no NAs suddenly have NAs. I tried ungrouping first, but that did not help.

I must be missing something really simple.

In the code below, table.ratings will have no NAs, but new.table does.

# Sample data
year <- c(0,0,0,0,1,1,1,1,2,2,2,2,
          0,0,0,0,1,1,1,1,2,2,2,2)
question <- c('a','b','c','d','a','b','c','d','a','b','c','d',
              'a','b','c','d','a','b','c','d','a','b','c','d')
rating <- c(3.22, 4.33, 2.33, 2.33, NA, 3.00, 
           2.33, 3.00, 5.00, 3.67, 4.33, NA,
           5.00, 4.10, 3.67, 2.82, 5.00, 5.00, 
           2.33, 3.00, 2.33, 5.00, 2.33, 3.00)
version <- c(1,2,1,2,1,2,1,2,1,2,1,2,
             1,2,1,2,1,2,1,2,1,2,1,2)

df <- data.frame(year, version, question, rating)

# Grouped dataframe with no NAs
table.ratings <- df %>%
  select(year, question, rating, version) %>%
  group_by(year, version, question) %>%
  dplyr::summarise(count = n(), 
        `Average Score` = mean(rating, na.rm = TRUE))

# This simply moves the cells around but results in NAs
new.table <- spread(table.ratings, key = "question", 
                    value = `Average Score`)

Can you show what output you'd expect? In `table.ratings` look for the value of `Average Score` where `year == 0`, `version == 1` and `question == b` (for example). — markus, Sep 05 '18 at 19:53
The first row of `new.table` has `year` as 0, `version` as 1, and `count` as 2. If we look at the original data with those values, `filter(table.ratings, year == 0 & version == 1 & count == 2)`, we see that there are 2 rows, corresponding to questions `a` and `c`. But in your spread data, we have columns for questions `a`, `b`, `c` and `d`. Since you don't have rows in the original data for that year/version/count combo with questions `b` and `d`, they are filled in with `NA`. The "why" is combinations exist in the wide data that do not exist in the long data. — Gregor Thomas, Sep 05 '18 at 19:53
It's not cells that exist in `table.ratings` get converted to `NA` values---new cells created in `new.table` don't have values in `table.ratings`, so they get filled with `NA`. You can use the `fill` argument if you'd rather have the default be something else. — Gregor Thomas, Sep 05 '18 at 19:54
Thank you for your comments. I should have added that in my actual data, there are more than 100,000 rows, and there is no combination of year, question, and version that has all NAs. When I group and summarise the data, there are averages for certain combinations that appear in the dataframe. Every combination is represented, and there are no NAs. Yet, when I spread that dataframe, new rows appear in the new table for those same combinations but with NAs. What I was hoping for is just a rearranging of the cells in table.ratings as if I were cutting and pasting while also transposing the data. — EmotionResearcher, Sep 05 '18 at 21:48
Oh wow, it was count! Once I removed `count = n()` from the code, it worked fine. Facepalm! — EmotionResearcher, Sep 05 '18 at 21:57

Why is the spread function in R adding NAs to my grouped and summarised dataframe?

0 Answers0