Aggregate a huge data frame: Sum of every five columns

Question

I am working with a huge data frame without column names that I need to aggregate into a specific format. The original data frame has 4915 observations of 1140 variables. I need to aggregate the dataset to a sum of every five variables in the set and drop the initial columns.

I created a small sample dataset that looks similar to my data

v1.x <- c("AFG", "ALB", "DZA", "AND", "AGO", "ATG", "ARG", "ARM", "ABW", "AUS", "AUT", "AZE", "BHS", "BHR", "BGD", "BRB", "BLR", "BEL", "BLZ", "BEN")
v2.x <- c("Agriculture","Fishing","Mining and Quarrying","Food & Beverages","Textiles and Wearing Apparel","Wood and Paper",
              "Petroleum, Chemical and Non-Metallic Mineral Products", "Metal Products", "Electrical and Machinery", "Transport Equipment",
              "Other Manufacturing","Recycling" ,"Electricity, Gas and Water" ,"Construction" ,"Maintenance and Repair" ,"Wholesale Trade", 
              "Retail Trade", "Hotels and Restraurants", "Transport","Post and Telecommunications")
v1.y <- c(1:20)
v2.y <- c(12:31)
v3 <- c(5:24)
v4 <- c(2:21)
v5 <- c(1:20)
v6 <- c(7:26)
v7 <- c(5:24)
v8 <- c(1:20)
v9 <- c(2:21)
v10 <- c(4:23)
v11 <- c(2:21)
v12 <- c(3:22)
v13 <- c(6:25)
v14 <- c(3:22)
v15 <- c(8:27)

df <- data.frame(v1.x, v2.x, v1.y, v2.y, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15)

I retrieved the results for the first batch of rows by applying rowSum()

df1 <- df %>%
  mutate(row_sum=rowSums(
    select(., "v1.y", "v2.y", "v3", "v4", "v5"))) %>%
  mutate(row_sum2=rowSums(
    select(., "v6", "v7", "v8", "v8", "v10"))) %>%
  mutate(row_sum3=rowSums(
    select(., "v11", "v12", "v13", "v14", "v15"))) %>%
  select(v1.x, v2.x, "row_sum",  "row_sum2", "row_sum3")

the results

 v1.x                         v2.x row_sum row_sum2 row_sum3
1  AFG                  Agriculture      21       17       22
2  ALB                      Fishing      26       21       27
3  DZA         Mining and Quarrying      31       25       32
4  AND             Food & Beverages      36       29       37
5  AGO Textiles and Wearing Apparel      41       33       42
6  ATG               Wood and Paper      46       37       47```

However, because the original data frame contains more than 1000 variables it would be very cumbersum to apply mutate() or rowmsum() for every batch of variables.

score 0 · Accepted Answer · answered Feb 03 '23 at 17:55

We could use split.default to split up every 5 numeric columns and get the rowwise sum with rowSums and bind with the original data

library(dplyr)
library(purrr)
library(stringr)
df %>%
   select(where(is.numeric)) %>% 
   split.default(as.integer(gl(ncol(.), 5, ncol(.)))) %>% 
   map_dfc(rowSums, na.rm = TRUE) %>% 
   rename_with(~ str_c("row_sum", .x)) %>%
   bind_cols(df %>% 
       select(where(negate(is.numeric))), .)

-output

  v1.x                                                  v2.x row_sum1 row_sum2 row_sum3
1   AFG                                           Agriculture       21       19       22
2   ALB                                               Fishing       26       24       27
3   DZA                                  Mining and Quarrying       31       29       32
4   AND                                      Food & Beverages       36       34       37
5   AGO                          Textiles and Wearing Apparel       41       39       42
6   ATG                                        Wood and Paper       46       44       47
7   ARG Petroleum, Chemical and Non-Metallic Mineral Products       51       49       52
8   ARM                                        Metal Products       56       54       57
9   ABW                              Electrical and Machinery       61       59       62
10  AUS                                   Transport Equipment       66       64       67
11  AUT                                   Other Manufacturing       71       69       72
12  AZE                                             Recycling       76       74       77
13  BHS                            Electricity, Gas and Water       81       79       82
14  BHR                                          Construction       86       84       87
15  BGD                                Maintenance and Repair       91       89       92
16  BRB                                       Wholesale Trade       96       94       97
17  BLR                                          Retail Trade      101       99      102
18  BEL                               Hotels and Restraurants      106      104      107
19  BLZ                                             Transport      111      109      112
20  BEN                           Post and Telecommunications      116      114      117

Aggregate a huge data frame: Sum of every five columns

1 Answers1