I am working with a huge data frame without column names that I need to aggregate into a specific format. The original data frame has 4915 observations of 1140 variables. I need to aggregate the dataset to a sum of every five variables in the set and drop the initial columns.
I created a small sample dataset that looks similar to my data
v1.x <- c("AFG", "ALB", "DZA", "AND", "AGO", "ATG", "ARG", "ARM", "ABW", "AUS", "AUT", "AZE", "BHS", "BHR", "BGD", "BRB", "BLR", "BEL", "BLZ", "BEN")
v2.x <- c("Agriculture","Fishing","Mining and Quarrying","Food & Beverages","Textiles and Wearing Apparel","Wood and Paper",
"Petroleum, Chemical and Non-Metallic Mineral Products", "Metal Products", "Electrical and Machinery", "Transport Equipment",
"Other Manufacturing","Recycling" ,"Electricity, Gas and Water" ,"Construction" ,"Maintenance and Repair" ,"Wholesale Trade",
"Retail Trade", "Hotels and Restraurants", "Transport","Post and Telecommunications")
v1.y <- c(1:20)
v2.y <- c(12:31)
v3 <- c(5:24)
v4 <- c(2:21)
v5 <- c(1:20)
v6 <- c(7:26)
v7 <- c(5:24)
v8 <- c(1:20)
v9 <- c(2:21)
v10 <- c(4:23)
v11 <- c(2:21)
v12 <- c(3:22)
v13 <- c(6:25)
v14 <- c(3:22)
v15 <- c(8:27)
df <- data.frame(v1.x, v2.x, v1.y, v2.y, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15)
I retrieved the results for the first batch of rows by applying rowSum()
df1 <- df %>%
mutate(row_sum=rowSums(
select(., "v1.y", "v2.y", "v3", "v4", "v5"))) %>%
mutate(row_sum2=rowSums(
select(., "v6", "v7", "v8", "v8", "v10"))) %>%
mutate(row_sum3=rowSums(
select(., "v11", "v12", "v13", "v14", "v15"))) %>%
select(v1.x, v2.x, "row_sum", "row_sum2", "row_sum3")
the results
v1.x v2.x row_sum row_sum2 row_sum3
1 AFG Agriculture 21 17 22
2 ALB Fishing 26 21 27
3 DZA Mining and Quarrying 31 25 32
4 AND Food & Beverages 36 29 37
5 AGO Textiles and Wearing Apparel 41 33 42
6 ATG Wood and Paper 46 37 47```
However, because the original data frame contains more than 1000 variables it would be very cumbersum to apply mutate() or rowmsum() for every batch of variables.