3

I am new to R and am trying to collapse rows based on row values with dplyr. The following example shows the sample data.

set.seed(123)

df<-data.frame(A=c(rep(1:4,4)),
               B=runif(16,min=0,max=1),
               C=rnorm(16, mean=1,sd=0.5))

   A B          c
1  1 0.36647435 0.7485365
2  2 0.51864614 0.8654337
3  3 0.04596929 0.9858012
4  4 0.15479619 1.1294208
5  1 0.76712372 1.2460700
6  2 0.17666676 0.7402996
7  3 0.89759874 1.2699954
8  4 0.90267735 0.7101804
9  1 0.91744223 0.3451281
10 2 0.25472599 0.8604743
11 3 0.10933985 0.8696796
12 4 0.71656017 1.2648846
13 1 0.21157810 1.3170205
14 2 0.14947268 1.2789700
15 3 0.92251060 1.5696901
16 4 0.30090579 1.7642853

I want to summarize/collapse two rows based on the condition that the rows in column A with values 1 and 2 as one row (as mean of row 1 and 2) . Therefore the final result will have only 12 rows because the other 4 rows has been collapsed.

I tried to use the following dplyr function but to little avail.

install.packages ("tidyverse") library (tidyverse)

df %>% summarize_each( fun(i){ for i %in% c(1,2)funs(mean) })

The expected output is something like:

   A    B           C
1  1.5  0.4425602   0.8069851
3  3    0.04596929  0.9858012
4  4    0.15479619  1.1294208
5  1.5  0.4718952   0.9931848
7  3    0.89759874  1.2699954
8  4    0.90267735  0.7101804
9  1.5  0.5860841   0.6028012
11 3    0.10933985  0.8696796
12 4    0.71656017  1.2648846
13 1.5  0.1805254   1.297995
15 3    0.92251060  1.5696901
16 4    0.30090579  1.7642853

Thank you in advance.

camille
  • 16,432
  • 18
  • 38
  • 60
G1124E
  • 407
  • 1
  • 10
  • 20
  • the rows are not completely collapsed, they're collapsed 2 by 2 and you don't explain the rule, in your real data you really have sequences of 4 rows after another like this ? – moodymudskipper Oct 28 '17 at 22:26
  • @Moody_Mudskipper the rows are to be collapsed based if the values in column A is 1 and 2. – G1124E Oct 28 '17 at 22:28
  • When generating random data, it would be helpful to include a `set.seed()` call in your example. That way others can replicate your results exactly! – Mikko Marttila Oct 29 '17 at 08:29
  • 1
    I had read this, but it seems to work by groups of 4, which is the assumption that the chosen answer took, but not explicit in your question – moodymudskipper Oct 29 '17 at 19:17

2 Answers2

2

By making the implicit, order based groupings explicit, the summary can be done with a single summarise_all call.

# Generate the data
set.seed(1)

df <- data.frame(
  A = c(rep(1:4, 4)), 
  B = runif(16, min = 0, max = 1), 
  C = rnorm(16, mean = 1, sd = 0.5)
)

library(dplyr)

new <- df %>%
  group_by(grp = rep(
    1:4,      # vector containing names of groups to create
    each = 4  # number of elements in each group
  )) %>% 
  group_by(mean_grp = cumsum(A > 2) + 1, add = T) %>%
  summarise_all(mean) %>%
  ungroup()

new
#> # A tibble: 12 x 5
#>      grp mean_grp     A         B           C
#>    <int>    <dbl> <dbl>     <dbl>       <dbl>
#>  1     1        1   1.5 0.3188163 1.067598241
#>  2     1        2   3.0 0.5728534 1.755890584
#>  3     1        3   4.0 0.9082078 1.194921618
#>  4     2        1   1.5 0.5500358 0.291014883
#>  5     2        2   3.0 0.9446753 1.562465459
#>  6     2        3   4.0 0.6607978 0.977533195
#>  7     3        1   1.5 0.3454502 1.231911487
#>  8     3        2   3.0 0.2059746 1.410610598
#>  9     3        3   4.0 0.1765568 1.296950661
#> 10     4        1   1.5 0.5355633 1.425278418
#> 11     4        2   3.0 0.7698414 1.037282492
#> 12     4        3   4.0 0.4976992 0.005324152

I would recommend keeping the grouping variables in your data after the summary (everything is simpler if you include them in the first place), but if you want to, you can drop them with new %>% select(-grp, -mean_grp).


PS. In order to avoid having "magic numbers" (such as the 1:4 and each = 4 when creating grp) included in the code, you could also create the first grouping variable as:

grp = cumsum(A < lag(A, default = A[1])) + 1

Assuming that the original data are ordered such that a new group starts each time the value of A is less than the previous value of A.

Mikko Marttila
  • 10,972
  • 18
  • 31
  • 1
    I tried to run the code, but I got this error message: `Error in mutate_impl(.data, dots) : wrong result size (64), expected 16 or 1` I am not sure why do we need the `each=4` argument in the first `group_by` function. I also tried to run the code by omitting the `each=4` argument but the output was not ordered. as expected. Can you please shed some light onto that? – G1124E Oct 29 '17 at 23:16
  • Ah that's a typo: must have left that in while doing some styling. You can omit the first `4`, but you need the `each` to group together the 4 adjacent rows. I've edited accordingly. – Mikko Marttila Oct 30 '17 at 05:05
1

One option would be to process the rows with A equal to 1 or 2 separately from the other rows and then bind them back together:

set.seed(3)
df<-data.frame(A=c(rep(1:4,4)),B=runif(16,min=0,max=1),c=rnorm(16, mean=1,sd=0.5))

df %>% 
  filter(A %in% 1:2) %>% 
  group_by(tmp=cumsum(A==1)) %>% 
  summarise_all(mean) %>% 
  ungroup %>% select(-tmp) %>% 
  bind_rows(df %>% filter(!A %in% 1:2))
       A         B         c
   <dbl>     <dbl>     <dbl>
 1   1.5 0.4877790 1.0121278
 2   1.5 0.6032474 0.8840735
 3   1.5 0.6042946 0.5996850
 4   1.5 0.5456424 0.6198039
 5   3.0 0.3849424 0.6276092
 6   4.0 0.3277343 0.4343907
 7   3.0 0.1246334 1.0760229
 8   4.0 0.2946009 0.8461718
 9   3.0 0.5120159 1.6121568
10   4.0 0.5050239 1.0999058
11   3.0 0.8679195 0.8981359
12   4.0 0.8297087 0.1667626
eipi10
  • 91,525
  • 24
  • 209
  • 285
  • Thank you @eipi10 I was wondering how you can preserve the order of rows, so that the rows can be arranged in increasing order like; 1.5, 3 ,4 ,1.5 3, 4, 1.5 3, 4, 1.5 3, 4. – G1124E Oct 29 '17 at 04:22