Efficient sampling from nested lists

Question

I have a list of lists, containing data.frames, from which I want to select only a few rows. I can achieve it in a for-loop, where I create a sequence based on the amount of rows and select only row indices according to that sequence.

But if I have deeper nested lists it doesn't work anymore. I am also sure, that there is a better way of doing that without a loop.

What would be an efficient and generic approach to sample from nested lists, that vary in their dimensions and contain data.frames or matrices?

## Dummy Data
n1=100;n2=300;n3=100
crdOrig <- list(
  list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
  list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
  list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)

## Code to opimize
FiltRef <- list()
filterBy = 10
for (r in 1:length(crdOrig)) { 
  tmp <- do.call(rbind, crdOrig[[r]])
  filterInd <- seq(1,nrow(tmp), by = filterBy)
  FiltRef[[r]] <- tmp[filterInd,]
}
crdResult <- do.call(rbind, FiltRef)

# Plotting
crdOrigPl <- do.call(rbind, unlist(crdOrig, recursive = F))
plot(crdOrigPl[,1], crdOrigPl[,2], col="red", pch=20)
points(crdResult[,1], crdResult[,2], col="green", pch=20)

The code above works also if a list contains several data.frames (data below).

## Dummy Data (Multiple DF)
crdOrig <- list(
  list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)),
       data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
  list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
  list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)

But if a list contains multiple lists, it throws an error trying to bind the result (FiltRef) together.

The result can be a data.frame with 2 columns (x,y) - like crdResult or a one dimensional list like FiltRef (from the first example)

## Dummy Data (Multiple Lists)
crdOrig <- list(
  list(list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
       list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)))),
  list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
  list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)

+1 and Thank you all for your brilliant answers! They all work and there is a lot to learn from each one of them. I will give this one to @Gwang-Jin Kim as his solution is the most flexible and extensive, although they all deserve to be checked!

Do you want an equal length list of lists object? So for last example, the first nested lists should be rbinded together? — Parfait, Jun 03 '18 at 17:27
I added the expected result, but yes it could also be a one dimensional list and the first nested list can be rbinded. — SeGa, Jun 03 '18 at 17:30

Roman Luštrik · Answer 1 · 2018-06-03T17:41:05.090

4

I would just flatten the whole darn thing and work on a clean list.

library(rlist)
out <- list.flatten(y)

# prepare a vector for which columns belong together
vc <- rep(1:(length(out)/2), each = 2)
vc <- split(1:length(vc), vc)

# prepare the final list
ll <- vector("list", length(unique(vc)))
for (i in 1:length(vc)) {
  ll[[i]] <- as.data.frame(out[vc[[i]]])
}

result <- lapply(ll, FUN = function(x) {
  x[sample(1:nrow(x), size = 10, replace = FALSE), ]
})

do.call(rbind, result)

           x        y
98  10.32912 52.87113
52  16.42912 46.07026
92  18.85397 46.26403
90  12.04884 57.79290
23  18.20997 40.57904
27  18.98340 52.55919
...

edited Jun 03 '18 at 17:41

answered Jun 03 '18 at 17:11

Roman Luštrik

69,533
24
154
197

If there are not just data.frames but matrices (same dimensions) at hand `do.call(rbind, result)` doesnt work, as there are unequal columns. – SeGa Jun 03 '18 at 18:27
2

@SeGa if that's the case, then you should probably get a cleaner dataset. :) – Roman Luštrik Jun 03 '18 at 18:38

Gwang-Jin Kim · Accepted Answer · 2018-06-04T18:28:21.637

Preparation and implementation of flatten

Well, there are many other answers which are in principle the same.

I meanwhile implemented for fun the flattening of nested lists.

Since I am thinking in Lisp:

Implemented first car and cdr from lisp.

car <- function(l) {
  if(is.list(l)) {
    if (null(l)) {
      list()
    } else {
      l[[1]]
    }
  } else {
    error("Not a list.")
  }
}

cdr <- function(l) {
  if (is.list(l)) {
    if (null(l) || length(l) == 1) {
      list()
    } else {
      l[2:length(l)]
    }
  } else {
    error("Not a list.")
  }
}

Some predicate functions:

null <- function(l) length(l) == 0   
# this is Lisp's `null` checking whether list is empty (`length(l) == 0`)
# R's `is.null()` checks for the value NULL and not `length(obj) == 0`

# upon @Martin Morgan's comment removed other predicate functions
# thank you @Martin Morgan!
# instead using `is.data.frame()` and `is.list()`, since they are
# not only already there but also safer.

Which are necessary to build flatten (for data frame lists)

flatten <- function(nested.list.construct) {
  # Implemented Lisp's flatten tail call recursively. (`..flatten()`)
  # Instead of (atom l) (is.df l).
  ..flatten <- function(l, acc.l) { 
    if (null(l)) {
      acc.l
    } else if (is.data.frame(l)) {   # originally one checks here for is.atom(l)
      acc.l[[length(acc.l) + 1]] <- l
      acc.l # kind of (list* l acc.l)
    } else {
      ..flatten(car(l), ..flatten(cdr(l), acc.l))
    }
  }
  ..flatten(nested.list.construct, list())
}

# an atom is in the widest sence a non-list object

After this, the actual function is defined using a sampling function.

Defining sampling function

# helper function
nrow <- function(df) dim(df)[1L]

# sampling function
sample.one.nth.of.rows <- function(df, fraction = 1/10) {
  # Randomly selects a fraction of the rows of a data frame
  nr <- nrow(df) 
  df[sample(nr, fraction * nr), , drop = FALSE]
}

The actual collector function (from nested data-frame-lists)

collect.df.samples <- function(df.list.construct, fraction = 1/10) {
  do.call(rbind, 
         lapply(flatten(df.list.construct), 
                function(df) sample.one.nth.of.rows(df, fraction)
               )
        )
}
# thanks for the improvement with `do.call(rbind, [list])` @Ryan!
# and the hint that `require(data.table)`
# `data.table::rbindlist([list])` would be even faster.

collect.df.samples first flattens the nested list construct of data frames df.list.construct to a flat list of data frames. It applies the function sample.one.nth.of.rows to each elements of the list (lapply). There by it produces a list of sampled data frames (which contain the fraction - here 1/10th of the original data frame rows). These sampled data frames are rbinded across the list. The resulting data frame is returned. It consists of the sampled rows of each of the data frames.

Testing on example

## Dummy Data (Multiple Lists)
n1=100;n2=300;n3=100
crdOrig <- list(
  list(list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
       list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)))),
  list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
  list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)

collect.df.samples(crdOrig, fraction = 1/10)

Refactoring for later modifications

By writing the collect.df.samples function to:

# sampler function
sample.10th.fraction <- function(df) sample.one.nth.of.rows(df, fraction = 1/10)

# refactored:
collect.df.samples <- 
  function(df.list.construct, 
           df.sampler.fun = sample.10th.fraction) {
  do.call(rbind, 
          lapply(flatten(df.list.construct), df.sampler.fun))
}

One can make the sampler function replace-able. (And if not: By changing the fraction parameter, one can enhance or reduce amount of rows collected from each data frame.)

The sampler function is in this definition easily exchangable

For choosing every nth (e.g. every 10th) row in the data frame, instead of a random sampling, you could e.g. use the sampler function:

df[seq(from=1, to=nrow(df), by = nth), , drop = FALSE]

and input it as df.sampler.fun = in collect.df.samples. Then, this function will be applied to every data frame in the nested df list object and collected to one data frame.

every.10th.rows <- function(df, nth = 10) {
  df[seq(from=1, to=nrow(df), by = nth), , drop = FALSE]
}

a.10th.of.all.rows <- function(df, fraction = 1/10) {
  sample.one.nth.of.rows(df, fraction)
}

collect.df.samples(crdOrig, a.10th.of.all.rows)
collect.df.samples(crdOrig, every.10th.rows)

Fun; FWIW R's S3 class system means that `class(x)` can return a vector of length > 1, so that `if (class(x) == "foo")` does not make sense; better to stick with `inherits()` or one of the predefined `is....()`. — Martin Morgan, Jun 03 '18 at 19:46
Thank you @Ryan. Ah, one is always learning - `do.call(rbind, [list])` then. Didn't knew that it is faster. And great, then I should use `rbindlist()` - Thank you for this nice hint! I actually wrote before `nrow(df)` :D. Thank you for your nice feedback! I will modify then! :) — Gwang-Jin Kim, Jun 03 '18 at 21:37
Thanks for that awesome answer!! Juts a small correction: I think a function is missing or the names are incorrect: `sample.one.nth.of.rows` is not found, I think it should be `sample.nth.of.rows`. — SeGa, Jun 04 '18 at 18:11
True, I changed them around :D - will correct. - I corrected! Welcome and thanks! It is if you have interactive R open, and then change names but the old definitions are still there ... you don't recognize certain mistakes. — Gwang-Jin Kim, Jun 04 '18 at 18:27

score 3 · Answer 3 · answered Jun 03 '18 at 17:41

Here's an answer in base borrowing from a custom "rapply" function mentioned here rapply to nested list of data frames in R

df_samples<-list()
i=1

f<-function(x) {
  i<<-i+1
  df_samples[[i]]<<-x[sample(rownames(x),10),]
}

recurse <- function (L, f) {
  if (inherits(L, "data.frame")) {
  f(L)  }
  else lapply(L, recurse, f)
}

recurse(crdOrig, f)

res<-do.call("rbind", df_samples)

Martin Morgan · Answer 4 · 2018-06-03T18:25:01.073

2

I too would flatten the list-of-lists into a standard representation (and do all analysis on the flattened representation, not just the subseting), but keep track of relevant indexing information, e.g.,

flatten_recursive = function(x) {
    i <- 0L
    .f = function(x, depth) {
        if (is.data.frame(x)) {
            i <<- i + 1L
            cbind(i, depth, x)
        } else {
            x = lapply(x, .f, depth + 1L)
            do.call(rbind, x)
        }
    }
    .f(x, 0L)
}

The internal function .f() visits each element of a list. If the element is a data.frame, it adds a unique identifier to index it. If it's a list, then it calls itself on each element of the list (incrementing a depth counter, in case this is useful, one could also add a 'group' counter) and then row-binds the elements. I use an internal function so that I can have a variable i to increment across function calls. The end result is a single data frame with a index to use for referencing the original results.

> tbl <- flatten_recursive(crdOrig) %>% as_tibble()
> tbl %>% group_by(i, depth) %>% summarize(n())
# A tibble: 4 x 3
# Groups:   i [?]
      i depth `n()`
  <int> <int> <int>
1     1     3   100
2     2     3   100
3     3     2   300
4     4     2   100
> tbl %>% group_by(i) %>% slice(seq(1, n(), by = 10)) %>% summarize(n())
# A tibble: 4 x 2
      i `n()`
  <int> <int>
1     1    10
2     2    10
3     3    30
4     4    10

The overall pattern of .f() can be adjusted for additional data types, e.g., (some details omitted)

.f <- function(x) {
    if (is.data.frame(x)) {
        x
    } else if (is.matrix(x)) {
        x <- as.data.frame(x)
        setNames(x, c("x", "y"))
    } else {
        do.call(rbind, lapply(x, .f))
    }
}

edited Jun 03 '18 at 18:25

answered Jun 03 '18 at 17:42

Martin Morgan

45,935
7
84
112

This seems to work. But is the `<<-` really necessary? It seems to work with normal `<-` too – SeGa Jun 03 '18 at 17:50
2

If you look at `i` when assigned with `<-` you'll see that all indexes are `1`, rather than 1..4 -- `<-` modifies the value of `i` locally, so each call to `.f()` increments `i` from 0 to 1... – Martin Morgan Jun 03 '18 at 17:52
Indeed! Thank you. In my case i dont need to keep track of the data origin, but its a nice feature! – SeGa Jun 03 '18 at 17:55
1

@SeGa without the `i` this method has no way of sampling each data.frame individually, so that's why it's needed. – IceCreamToucan Jun 03 '18 at 18:04
With the real data I'm getting a `Error: C stack usage`. Why is that? – SeGa Jun 03 '18 at 18:12
How nested is your real data? – Martin Morgan Jun 03 '18 at 18:14
Its a list of 6 lists, each a List of 1, but i just saw, that there are matrices aswell (like: `..$ : num [1:200, 1:2]`), not just data.frames! – SeGa Jun 03 '18 at 18:17
1

I suggested an extension to `.f()`, but you'll need to incorporate `i` and `depth`; should be easy enough and a good exercise 'for the student'! – Martin Morgan Jun 03 '18 at 18:25

Parfait · Answer 5 · 2018-06-03T20:08:27.930

2

Consider a recursive call conditionally checking if first item is a data.frame or list class.

stack_process <- function(lst){
  if(class(lst[[1]]) == "data.frame") {
    tmp <- lst[[1]]
  } 

  if(class(lst[[1]]) == "list") {
    inner <- lapply(lst, stack_process)        
    tmp <- do.call(rbind, inner)
  }

  return(tmp)
}

new_crdOrig <- lapply(crdOrig, function(x) {
  df <- stack_process(x)

  filterInd <- seq(1, nrow(df), by = filterBy)
  return(df[filterInd,])
})

final_df <- do.call(rbind, new_crdOrig)

edited Jun 03 '18 at 20:08

answered Jun 03 '18 at 17:52

Parfait

104,375
17
94
125

Is there a reason not to prefer the simpler `inner <- lapply(lst, stack_process)`, which doesn't require pre-allocation of `inner` and implicit re-allocation in the loop? – Martin Morgan Jun 03 '18 at 18:04
Indeed, @MartinMorgan, good point! I got caught in the weeds and couldn't see it. – Parfait Jun 03 '18 at 20:06

Efficient sampling from nested lists

5 Answers5

Linked