8
df %>% split(.$x)

becomes slow for large number of unique values of x. If we instead split the data frame manually into smaller subsets and then perform split on each subset we reduce the time by at least an order of magnitude.

library(dplyr)
library(microbenchmark)
library(caret)
library(purrr)

N      <- 10^6
groups <- 10^5
df     <- data.frame(x = sample(1:groups, N, replace = TRUE), 
                     y = sample(letters,  N, replace = TRUE))
ids      <- df$x %>% unique
folds10  <- createFolds(ids, 10)
folds100 <- createFolds(ids, 100)

Running microbenchmark gives us

## Unit: seconds

## expr                                                  mean
l1 <- df %>% split(.$x)                                # 242.11805

l2 <- lapply(folds10,  function(id) df %>% 
      filter(x %in% id) %>% split(.$x)) %>% flatten    # 50.45156  

l3 <- lapply(folds100, function(id) df %>% 
      filter(x %in% id) %>% split(.$x)) %>% flatten    # 12.83866  

Is split not designed for large groups? Are there any alternatives besides the manual initial subsetting?

My laptop is a macbook pro late 2013, 2.4GHz 8GB

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
Rickard
  • 3,600
  • 2
  • 19
  • 22
  • I want to process the resulting list items in parallel, i.e. `list_of_dataframes %>% map(sequentially_process_each_row_of_df)` – Rickard Sep 17 '16 at 09:57
  • Consider, also, `order`ing `df` before `split`ting, so that `.Internal(split())` accesses memory more consecutively -- `system.time({ a = split(df, df$x) }); system.time({ odf = df[order(df$x), ]; b = split(odf, odf$x) }); identical(a, b)` – alexis_laz Sep 17 '16 at 16:30
  • @alexis_laz actually, ordering creates row names, rather than improving memory access patterns -- compare `.row_names_info(df)` and `.row_names_info(df[order(df$x),])`; the negative value in the first case indicates that the row names are stored compactly as `c(NA, 1000000)`, the positive value in the second case that they are stored literally as an integer vector. – Martin Morgan Sep 18 '16 at 05:41
  • @MartinMorgan : You're right - I totally missed that, thanks. Setting `row.names() = NULL` bumps up execution time significantly. Besides, I guess, that -since each `df$x` contains a small amount of elements- populating the indices for each group successively (in the internal splitting) should not make that difference. – alexis_laz Sep 19 '16 at 10:41

3 Answers3

11

More an explanation than an answer. Sub-setting a large data.frame is more costly than sub-setting a small data frame

> df100 = df[1:100,]
> idx = c(1, 10, 20)
> microbenchmark(df[idx,], df100[idx,], times=10)
Unit: microseconds
         expr     min      lq     mean  median      uq     max neval
    df[idx, ] 428.921 441.217 445.3281 442.893 448.022 475.364    10
 df100[idx, ]  32.082  32.307  35.2815  34.935  37.107  42.199    10

split() pays this cost for each group.

The reason can be seen by running Rprof()

> Rprof(); for (i in 1:1000) df[idx,]; Rprof(NULL); summaryRprof()
$by.self
       self.time self.pct total.time total.pct
"attr"      1.26      100       1.26       100

$by.total
               total.time total.pct self.time self.pct
"attr"               1.26       100      1.26      100
"[.data.frame"       1.26       100      0.00        0
"["                  1.26       100      0.00        0

$sample.interval
[1] 0.02

$sampling.time
[1] 1.26

All of the time is being spent in a call to attr(). Stepping through the code using debug("[.data.frame") shows that the pain involves a call like

attr(df, "row.names")

This small example shows a trick that R uses to avoid representing row names that are not present: use c(NA, -5L), rather than 1:5.

> dput(data.frame(x=1:5))
structure(list(x = 1:5), .Names = "x", row.names = c(NA, -5L), class = "data.frame")

Note that attr() returns a vector -- the row.names are created on the fly, and for a large data.frame a large number of row.names are created

> attr(data.frame(x=1:5), "row.names")
[1] 1 2 3 4 5

So one might expect that even nonsensical row.names would speed the calculation

> dfns = df; rownames(dfns) = rev(seq_len(nrow(dfns)))
> system.time(split(dfns, dfns$x))
   user  system elapsed 
  4.048   0.000   4.048 
> system.time(split(df, df$x))
   user  system elapsed 
 87.772  16.312 104.100 

Splitting a vector or matrix would also be fast.

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
2

This isn't strictly split.data.frame issue, there is a more general problem on scalability of data.frame for many groups.
You can get pretty nice speed up if you use split.data.table. I developed this method on top of regular data.table methods and it seems to scale pretty well here.

system.time(
    l1 <- df %>% split(.$x)   
)
#   user  system elapsed 
#200.936   0.000 217.496 
library(data.table)
dt = as.data.table(df)
system.time(
    l2 <- split(dt, by="x")   
)
#   user  system elapsed 
#  7.372   0.000   6.875 
system.time(
    l3 <- split(dt, by="x", sorted=TRUE)   
)
#   user  system elapsed 
#  9.068   0.000   8.200 

sorted=TRUE will return the list of the same order as data.frame method, by default data.table method will preserve order present in input data. If you want to stick to data.frame you can at the end use lapply(l2, setDF).

PS. split.data.table was added in 1.9.7, installation of devel version is pretty simple

install.packages("data.table", type="source", repos="http://Rdatatable.github.io/data.table")

More about that in Installation wiki.

jangorecki
  • 16,384
  • 4
  • 79
  • 160
  • 1
    split.data.table was significantly faster. I ended up re-writing parts of my code using data.table. – Rickard Sep 22 '16 at 16:18
-1

A very nice cheat exploiting the group_split of dplyr 0.8.3 or above :

random_df <- tibble(colA= paste("A",1:1200000,sep = "_"), 
                    colB= as.character(paste("A",1:1200000,sep = "_")),
                    colC= 1:1200000)

random_df_list <- split(random_df, random_df$colC)

random_df_list <- random_df %>% group_split(colC)

Reduces an operation of a few minutes to a few seconds !