R: Cleaning up a wide and untidy dataframe

Question

I have a data frame that looks like:

d<-data.frame(id=(1:9), 
                  grp_id=(c(rep(1,3), rep(2,3), rep(3,3))), 
                  a=rep(NA, 9), 
                  b=c("No", rep(NA, 3), "Yes", rep(NA, 4)), 
                  c=c(rep(NA,2), "No", rep(NA,6)), 
                  d=c(rep(NA,3), "Yes", rep(NA,2), "No", rep(NA,2)), 
                  e=c(rep(NA, 7), "No", NA), 
                  f=c(NA, "No", rep(NA,3), "No", rep(NA,2), "No"))
>d
  id grp_id  a    b    c    d    e    f
1  1      1 NA   No <NA> <NA> <NA> <NA>
2  2      1 NA <NA> <NA> <NA> <NA>   No
3  3      1 NA <NA>   No <NA> <NA> <NA>
4  4      2 NA <NA> <NA>  Yes <NA> <NA>
5  5      2 NA  Yes <NA> <NA> <NA> <NA>
6  6      2 NA <NA> <NA> <NA> <NA>   No
7  7      3 NA <NA> <NA>   No <NA> <NA>
8  8      3 NA <NA> <NA> <NA>   No <NA>
9  9      3 NA <NA> <NA> <NA> <NA>   No

Within each group (grp_id) there is only 1 "Yes" or "No" value associated with each of the columns a:f.

I'd like to create a single row for each grp_id to get a data frame that looks like the following:

grp_id  a    b    c    d    e    f
     1 NA   No   No <NA> <NA>   No
     2 NA  Yes <NA>  Yes <NA>   No
     3 NA <NA> <NA>   No   No   No

I recognize that the tidyr package is probably the best tool and the 1st steps are likely to be

d %>% 
   group_by(grp_id) %>%
     summarise()

I would appreciate help with the commands within summarise, or any solution really. Thanks.

akrun · Answer 1 · 2017-11-19T19:36:10.623

1

We can use summarise_at and subset the first non-NA element

library(dplyr)
d %>%
   group_by(grp_id) %>%
   summarise_at(2:7, funs(.[!is.na(.)][1]))
# A tibble: 3 x 7
#   grp_id     a      b      c      d      e      f
#    <dbl> <lgl> <fctr> <fctr> <fctr> <fctr> <fctr>
#1      1    NA     No     No   <NA>   <NA>     No
#2      2    NA    Yes   <NA>    Yes   <NA>     No
#3      3    NA   <NA>   <NA>     No     No     No

In the example dataset, columns 'a' to 'f' are all factors with some having only 'No' levels. If it needs to be standardized with all the columns having the same levels, then we may need to call the factor with levels specified as c('Yes', 'No') in the summarise_at i.e. summarise_at(2:7, funs(factor(.[!is.na(.)][1], levels = c('Yes', 'No'))))

edited Nov 19 '17 at 19:36

answered Nov 19 '17 at 18:45

akrun

874,273
37
540
662

I'm getting the following error: Error in summarise_impl(.data, dots) : Evaluation error: object 'NA' not found. – user2230555 Nov 19 '17 at 18:49
@user2230555 As I am using your example, could it be a version issue. I use `dplyr_0.7.4` – akrun Nov 19 '17 at 18:50
Ah, that may be it. I'm using 0.7.1. I'll update and try again. Thanks. – user2230555 Nov 19 '17 at 18:51
Hhmmm, even with dplyr_0.7.4 I'm still getting that error. – user2230555 Nov 19 '17 at 18:59
@user2230555 Are you using the same example or different one? Can u try `d %>% group_by(grp_id) %>% summarise_at(2:7, funs(if(all(is.na(.))) NA else .[!is.na(.)][1]))` – akrun Nov 19 '17 at 19:00
It works nicely with the example dataset (d), but not with my actual dataset. I'm trying to figure out what's different between them. – user2230555 Nov 19 '17 at 19:03
@user2230555 That was what I was trying to understand. Can u try `summarise_at(2:7, funs(.[order(is.na(.))][1]))` – akrun Nov 19 '17 at 19:04
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/159345/discussion-between-user2230555-and-akrun). – user2230555 Nov 19 '17 at 19:05
@G.Grothendieck It can be solved easily with `factor` call. Also, in the OP' example starts with inconsistent factor levels. So I don't know which one the OP prefers – akrun Nov 19 '17 at 19:28
would you have a similar data.table way of doing ? I was trying to figure out a similar and simple way as in your answer but I didn't find. – denis Nov 19 '17 at 21:30
1

@denis `setDT(d)[, lapply(.SD, function(x) x[!is.na(x)][1]), by = grp_id, .SDcols = letters[1:6])` – akrun Nov 20 '17 at 03:05
1

@akrun Thanks! I think I never used the lapply this way, I learned something. – denis Nov 20 '17 at 09:13

G. Grothendieck · Answer 2 · 2017-11-20T13:41:25.027

We can use aggregate. No packages are used.

 YN <- function(x) c(na.omit(as.character(x)), NA)[1]
 aggregate(d[3:8], d["grp_id"], YN)

giving:

##   grp_id    a    b    c    d    e  f
## 1      1 <NA>   No   No <NA> <NA> No
## 2      2 <NA>  Yes <NA>  Yes <NA> No
## 3      3 <NA> <NA> <NA>   No   No No

The above gives character columns. If you prefer factor columns then use this:

YNfac <- function(x) factor(YN(x), c("No", "Yes"))
aggregate(d[3:8], d["grp_id"], YNfac)

Note: Other alternate implementations of YN are:

YN <- function(x) sort(as.character(x), na.last = TRUE)[1]

YN <- function(x) if (all(is.na(x))) NA_character_ else na.omit(as.character(x))[1]

library(zoo)
YN <- function(x) na.locf0(as.character(x), fromLast = TRUE)[1]

Curt F. · Answer 3 · 2017-11-21T17:00:05.173

You've received some good answers but neither of them actually uses the tidyr package. (The summarize() and summarize_at() family of functions is from dplyr.)

In fact, a tidyr-only solution for your problem is very doable.

d %>%
    gather(col, value, -id, -grp_id, factor_key=TRUE) %>%
    na.omit() %>%
    select(-id) %>%
    spread(col, value, fill=NA, drop=FALSE)

The only hard part is ensuring that you get the a column in your output. For your example data, it is entirely NA. The trick is the factor_key=TRUE argument to gather() and the drop=FALSE argument to spread(). Without those two arguments being set, the output would not have an a column, and would only have columns with at least one non-NA entry.

Here's a description of how it works:

gather(col, value, -id, -grp_id, factor_key=TRUE) %>%

This tidies your data -- it effectively replaces columns a - f with new columns col and value, forming a long-formated "tidy" data frame. The entries in the col column are letters a - f. And because we've used factor_key=TRUE, this column is a factor with levels, not just a character vector.

na.omit() %>%

This removes all the NA values from the long data.

select(-id) %>%

This eliminates the id column.

spread(col, value, fill=NA, drop=FALSE)

This re-widens the data, using the values in the col column to define new column names, and the values in the value column to fill in the entries of the new columns. When data is missing, a value of fill (here NA) is used instead. And the drop=FALSE means that when col is a factor, there will be one column per level of the factor, no matter whether that level appears in the data or not. This, along with setting col to be a factor, is what gets a as an output column.

I personally find this approach more readable than the approaches requiring subsetting or lapply stuff. Additionally, this approach will fail if your data is not actually one-hot, whereas other approaches may "work" and give you unexpected output. The downside of this approach is that the output columns a - f are not factors, but character vectors. If you need factor output you should be able to do (untested)

mutate(value = factor(value, levels=c('Yes', 'No', NA))) %>%

anywhere between the gather() and spread() functions to ensure factor output.

Actually, because I used `select` I guess this answer is not `tidyr`-only. Apologies. But I still think this approach is worth considering. If your data is not truly one-hot, this approach will have different failure modes than the `summarize_at` approach. — Curt F., Nov 20 '17 at 23:36
Wow. I don't have this dataset open at the moment but I really appreciate the didactic and verbose explanation of what each command is doing! — user2230555, Nov 21 '17 at 03:02
I'm not sure if you are sarcastic or not, but being verbose and didactic was not my intention. — Curt F., Nov 21 '17 at 04:05
I am not being sarcastic at all! That was a compliment. I'm trying to learn the tidyr commands and this was really helpful teaching. — user2230555, Nov 21 '17 at 14:42
Ok, I've had some time to work through this now. I agree that this approach is the most readable solution. Also, thanks for introducing me to the term one-hot. — user2230555, Nov 21 '17 at 19:45

R: Cleaning up a wide and untidy dataframe

3 Answers3

Linked