tidyr spread sorting inconsistencies

Question

Ive noticed tidyr(0.4.0) sorts value columns when using spread, where as tidyr(0.3.1) returns the value columns in the order they were prior to a gather.

Reproducible example 1:

library(dplyr)
library(tidyr)
# tidyr 0.3.1

dat<-data.frame(name=rep(c("A","B"),5),sam.id=rep(c(1,2),5),
      frac=sample(c(0.05,0.1,0.2),10,replace=TRUE),
      Aspecies=rnorm(10),Bspecies=rnorm(10),Zspecies=rnorm(10))

I aggregate the species values by both sam.id and frac(the proportion of the sample measured) i.e a multiple gather.

dt.agg.0.3.1 <- gather(dat,key,value,-name,-sam.id) %>% 
                group_by(name,key) %>% 
                summarise(Total=sum(value)) %>% spread(key,Total) %>%
                mutate(all=rowSums(.[,3:5]))

The last part of the pipe computes a simple total of all species using mutate. So that:

head(dt.agg.0.3.1)

Source: local data frame [2 x 6]

name  frac  Aspecies    Bspecies  Zspecies       all
(fctr) (dbl)     (dbl)       (dbl)     (dbl)     (dbl)
1      A  0.85 -2.675137 -0.03287276  1.016791 -1.858010
2      B  0.40  4.194904  1.50561762 -2.738543  6.100522

Reproducible example 2:

library(tidyr)
# 0.4.0

dt.agg.0.4.0 <- gather(dat,key,value,-name,-sam.id) %>%
                group_by(name,key) %>% 
                summarise(Total=sum(value)) %>% spread(key,Total)

head(dt.agg.0.4.0)

Source: local data frame [2 x 5]
Groups: name [2]

name  Aspecies    Bspecies  frac  Zspecies
(fctr)     (dbl)       (dbl) (dbl)     (dbl)
1      A -2.675137 -0.03287276  0.85  1.016791
2      B  4.194904  1.50561762  0.40 -2.738543

One can see how the ordering of the value columns has been changed (alphabetically), which makes additional data-pipeline steps using mutate for example troublesome.

dt.agg.0.4.0.mutated <- gather(dat,key,value,-name,-sam.id) %>%
                        group_by(name,key) %>% summarise(Total=sum(value)) %>%
                        spread(key,Total) %>% mutate(all=rowSums(.[,2:5]))

Throws the error;

Error: incompatible size (2), expecting 1 (the group size) or 1

Is there a way for tidyr(0.4.0) to spread back out in the order of the gather?

Or will have to gather (and summarise) twice -- once for each key-value pair?

akrun · Accepted Answer · 2016-01-30T17:17:34.557

We can use ungroup after the spread and grep the columns that needs to used for rowSums (tidyr_0.4.0).

gather(dat, key, value, -name, sam.id) %>% 
           group_by(name, key) %>% 
           summarise(Total=sum(value)) %>%
           spread(key, Total) %>%
           ungroup() %>%
           mutate(all= rowSums(.[grep('species', names(.))]))
#     name Aspecies   Bspecies  frac sam.id Zspecies      all
#   (fctr)    (dbl)      (dbl) (dbl)  (dbl)    (dbl)    (dbl)
#1      A 5.795958 -0.4769954   0.4      5 3.965114 9.284077
#2      B 2.475395 -1.4858969   0.5     10 1.045175 2.034674

If we need to get the order of the columns as that occurs in the 'key' column, then we may need to convert the 'key' to factor class with levels specified. In this case, we can use the position index in rowSums.

gather(dat,key,value,-name,-sam.id) %>% 
      mutate(key= factor(key, levels=unique(key))) %>% 
      group_by(name, key) %>%
      summarise(Total = sum(value)) %>% 
      spread(key, Total) %>% 
      ungroup() %>%
      mutate(all = rowSums(.[3:5]))
#Source: local data frame [2 x 6]

#    name  frac Aspecies   Bspecies Zspecies      all
#  (fctr) (dbl)    (dbl)      (dbl)    (dbl)    (dbl)
#1      A   0.4 5.795958 -0.4769954 3.965114 9.284077
#2      B   0.5 2.475395 -1.4858969 1.045175 2.034674

If we look at the str after the spread step i.e.

res <- gather(dat, key, value, -name, sam.id) %>% 
                   group_by(name, key) %>% 
                   summarise(Total=sum(value)) %>% 
                   spread(key, Total)

str(res)
#Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  6 variables:
#...

One of the class is 'grouped_df', which somehow creates the problem.

str(res %>% ungroup)  
#Classes ‘tbl_df’, ‘tbl’ and 'data.frame':       2 obs. of  6 variables:

NOTE: The values are different from the OP's post as no set.seed was specified.

great! I'll use `mutate(key= factor(key, levels=unique(key)))` as the actual species names vary but are rather arranged in sequential columns... — danny_C_O_T_W, Jan 30 '16 at 18:15
@kleanBean The `unique(key)` will order the elements in the `key` as it appears in the column. If you need a custom order, then specify the levels as it is needed. i.e. `levels = c('frac', 'Aspecies', 'Somethingelse'...` Regarding the difference in both versions, this may be a bug.. — akrun, Jan 30 '16 at 18:17
Now there is a parameter for `gather()` `factor_key=T` which preserves the original order too. — puslet88, Aug 10 '18 at 09:06

tidyr spread sorting inconsistencies

1 Answers1