How can I split a large dataset and remove the variable that it was split by [R]

Question

I'd like to split my dataset using the variable group and then remove that variable from the resulting dataset. Right now, I'm using a for loop, but I'm looking for something that avoids a loop and something in base R without loading dplyr or a similar package.

n <- 10
x <- runif(n)*10
y <- runif(n)*10
group <- rep(1:2, each=5)

my_data <- as.data.frame(cbind(group, x, y))
subset_data <- split(my_data, my_data$group, drop=TRUE)


drop_column <- "group"
for (i in 1:length(unique(group))){
  subset_data[[i]] <- subset_data[[i]][,!(names(subset_data[[i]]) %in% drop_column)]
}

Thank you.

AndrewGB · Accepted Answer · 2022-01-30T18:44:34.873

A base R option is to subset (i.e., remove the grouping column) the data first. Next, I can split the dataframe with the original grouping column.

split(subset(my_data, select = -group), my_data$group)

However, if the grouping column is always in the first position, then you can just use the index, rather than subset to remove the grouping column for the output.

split(my_data[-1], my_data$group)

Output

$`1`
         x         y
1 3.421037 0.2846179
2 9.219159 5.0449367
3 4.157628 1.3970608
4 3.412703 2.2196774
5 9.948763 6.5528746

$`2`
           x         y
6  0.3746215 3.4387533
7  3.0722134 0.5371084
8  3.0580508 0.4649525
9  3.6308661 6.5796197
10 6.4435513 3.0641620

Another base R option is to use subset inside lapply. You can use split and remove the grouping variable all in one step.

lapply(split(my_data, my_data$group, drop=TRUE), subset, select = -group)

You may also subset first and then `split` - `split(my_data[-1], my_data$group)` or if you have to use `drop_column` then `split(my_data[setdiff(names(my_data), drop_column)], my_data$group)` — Ronak Shah, Jan 30 '22 at 09:01

score 1 · Answer 2 · answered Jan 30 '22 at 07:26

1

You can use group_split from dplyr and sett the keep parameter to FALSE:

library(dplyr)
subset_data <- my_data |>
  group_split(group, .keep = FALSE)

<list_of<
  tbl_df<
    x: double
    y: double
  >
>[2]>
[[1]]
# A tibble: 5 x 2
      x     y
  <dbl> <dbl>
1  9.43  1.84
2  2.34  9.41
3  6.96  7.56
4  7.91  5.11
5  1.52  3.38

[[2]]
# A tibble: 5 x 2
       x     y
   <dbl> <dbl>
1 2.71   6.14 
2 0.959  8.13 
3 0.0337 0.315
4 1.26   8.30 
5 4.73   0.122

answered Jan 30 '22 at 07:26

deschen

10,012
3
27
50

This is the route I'd usually go, but the OP specifically say that they want base R and not `dplyr`. – AndrewGB Jan 30 '22 at 07:51
Yes, I'm trying to avoid loading too many packages into my environment. – thewan Jan 30 '22 at 08:09
Sorry, missed that part. Although I usually don‘t buy/understand this „don‘t want to load many things“ argument. What‘s the point of it? I mean, R lives through packages and their helpful extensions (not saying that base R wouldn‘t be able to solve the TOs problem). And I‘d say dplyr (along with e.g. data.table) is one of the few key packages that don‘t hurt (others are free to disagree of course). – deschen Jan 30 '22 at 08:38
Imagine you have to guarantee reproducibility and rebuild the working environment often (i.e. you're reviewing a PR with pinned package versions). Compiling many packages over and over again can be quite time consuming. Something similar can happen when you're packaging (pinned versions) into Docker containers. There's also a problem if packages are not stable and change API from one release to another (can be solved with pinning versions where renv package has made things quite easy). – Roman Luštrik Jan 30 '22 at 08:58
Good points actually. But as you said, you can create stable environments using „fixed“ versions of packages. Although that‘s not always easy, true. – deschen Jan 30 '22 at 09:33
Right now I'm building a docker image that takes about 15-20 minutes to build due to all the R package dependencies. Each dependency has a small overhead cost, but these add up. Then there's system dependencies that might come along... – Roman Luštrik Feb 01 '22 at 09:27

How can I split a large dataset and remove the variable that it was split by [R]

2 Answers2