How to structure R code for inherently nested problems to be easily readible?

Question

There are problems that inherently require several layers of nesting to be solved. In a current project, I frequently find myself using three nested applys in order to do something with the elements contained in the deepest layer of a nested list structure.

R's list handling and apply-family allows for quite concise code for this type of problems, but writing it still gives me headaches and I'm pretty sure that anyone else reading it will have to take several minutes to understand what I am doing. This is although I'm basically doing nothing complicated - if it weren't for the multiple layers of lists to be traversed.

Below I provide a snippet of code that I wrote that shall serve as an example. I would consider it concise, yet hard to read.

Context: I am writing a simulation of surface electromyographic data, i.e. changes in the electrical potential that can be measured on the human skin and which is invoked by muscular activity. For this, I consider several muscles (first layer of lists) which each consist of a number of so-called motor units (second layer of lists) which each are related to a set of electrodes (third layer of lists) placed on the skin. In the example code below, we have the objects .firing.contribs, which contains the information on how strong the action of a motor unit influences the potential at a specific electrode, and .firing.instants which contains the information on the instants in time at which these motor units are fired. The given function then calculates the time course of the potential at each electrode.

Here's the question: What are possible options to render this type of code easily understable for the reader? Particularly: How can I make clearer what I am actually doing, i.e. performing some computation on each (MU, electrode) pair and then summing up all potential contributions to each electrode? I feel this is currently not very obvious in the code.

Notes:

I know of plyr. So far, I have not been able to see how I can use this to render my code more readable, though.
I can't convert my structure of nested lists to a multidimensional array since the number of elements is not the same for each list element.
I searched for R implementations of the MapReduce algorithm, since this seems to be applicable to the problem I give as an example (although I'm not a MapReduce expert, so correct me if I'm wrong...). I only found packages for Big Data processing, which is not what I'm interested in. Anyway, my question is not about this particular (Map-Reducible) case but rather about a general programming pattern for inherently nested problems.
I am not interested in performance for the moment, just readability.

Here's the example code.

sum.MU.firing.contributions <- function(.firing.contribs, .firing.instants) {

    calc.MU.contribs <- function(.MU.firing.contribs, .MU.firing.instants)
        lapply(.MU.firing.contribs, calc.MU.electrode.contrib,
               .MU.firing.instants)

    calc.muscle.contribs <- function(.MU.firing.contribs, .MU.firing.instants) {

        MU.contribs <- mapply(calc.MU.contribs, .MU.firing.contribs,
                              .MU.firing.instants, SIMPLIFY = FALSE)

        muscle.contribs <- reduce.contribs(MU.contribs)
    }

    muscle.contribs <- mapply(calc.muscle.contribs, .firing.contribs,
                              .firing.instants, SIMPLIFY = FALSE)

    surface.potentials <- reduce.contribs(muscle.contribs)
}

## Takes a list (one element per object) of lists (one element per electrode)
## that contain the time course of the contributions of that object to the
## surface potential at that electrode (as numerical vectors).
## Returns a list (one element per electrode) containing the time course of the
## contributions of this list of objects to the surface potential at all
## electrodes (as numerical vectors again).
reduce.contribs <- function(obj.list) {

    contribs.by.electrode <- lapply(seq_along(obj.list[[1]]), function(i)
                                    sapply(obj.list, `[[`, i))

    contribs <- lapply(contribs.by.electrode, rowSums)
}


calc.MU.electrode.contrib <- function(.MU.firing.contrib, .MU.firing.instants) {

    ## This will in reality be more complicated since then .MU.firing.contrib
    ## will have a different (more complicated) structure.
    .MU.firing.contrib * .MU.firing.instants
}

firing.contribs <- list(list(list(1,2),list(3,4)),
                         list(list(5,6),list(7,8),list(9,10)))

firing.instants <- list(list(c(0,0,1,0), c(0,1,0,0)),
                         list(c(0,0,0,0), c(1,0,1,0), c(0,1,1,0)))

surface.potentials <- sum.MU.firing.contributions(firing.contribs, firing.instants)

I know the question and the example are long, but I felt that this question required a 'real' example instead of a minimal one. Please let me know if I can rephrase the question or the example to make it more accessible. This is my first post, after all. — Eike P., Jul 04 '14 at 17:21
I 'm commenting on the part where you mention about three layers of list. Would, perhaps, be helpful for your work to have your layers in a long "data.frame" format? E.g. column1 = muscle, column2 = MU, column3 = electrode. Many R tools do complicated work with less code on such formats and many efficiency-oriented packages work on such formats, too. — alexis_laz, Jul 04 '14 at 18:00
Thank you for your comment! Just to make sure that I understand you correctly: I would have three columns of indices (muscle, MU, electrode) and then one (or multiple) column(s) containing the actual data. So I might have a row (1,2,3,4) indicating that the potential invoked by the second MU in the first muscle at the third electrode would be 4. Is that what you meant? If yes, I will just go ahead, try to implement it and see if it helps. :-) — Eike P., Jul 04 '14 at 22:30
Exactly that, yes! One column per categorical variable (i.e. `factor`s in R) and one column per dependent variable. The examples in the help page of `?reshape` might be helpful and, also, see functions like `aggregate`, `split` (followed by `lapply`), `by`, `ave`, `tapply` and their "See also"s. — alexis_laz, Jul 04 '14 at 22:45
Thanks for the hint, I'll definitely have a deeper look into that once I find the time (might take one or two days, though). I have the impression that this will greatly clarify my code. Interesting enough, considering that this changes the subject of my question from "how do I implement this nicely for my complicated data structure" to "how should I choose my data structure in order to be able to implement this nicely". — Eike P., Jul 04 '14 at 23:41
Just posted an answer based on your suggestion, @alexis_laz, and indeed it greatly improved the code. Lesson learned: If the implementation gets complicated, think about your data structure. :-) — Eike P., Jul 07 '14 at 14:21
Side note: Should I remove all context specific information from the question to render it more abstract? — Eike P., Jul 07 '14 at 14:23

score 1 · Accepted Answer · edited May 23 '17 at 11:50

As suggested by user @alexis_laz, it is a good option to actually not use a nested list structure for representing data, but rather use a (flat, 2D, non-nested) data.frame. This greatly simplified coding and increased code conciseness and readability for the above example and seems to be promising for other cases, too.

It completely eliminates the need for nested apply-foo and complicated list traversing. Moreover, it allows for the usage of a lot of built-in R functionality that works on data frames.

I think there are two main reasons why I did not consider this solution before:

It does not feel like the natural representation of my data, since the data actually are nested. By fitting the data into a data.frame, we're losing direct information on this nesting. It can of course be reconstructed, though.
It introduces redundancy, as I will have a lot of equivalent entries in the categorical columns. Of course this doesn't matter by any measure since it's just indices that don't consume a lot of memory or something, but it still feels kind of bad.

Here is the rewritten example. Comments are most welcome.

sum.MU.firing.contributions <- function(.firing.contribs, .firing.instants) {

    firing.info <- merge(.firing.contribs, .firing.instants)

    firing.info$contribs.time <- mapply(calc.MU.electrode.contrib,
                                        firing.info$contrib,
                                        firing.info$instants,
                                        SIMPLIFY = FALSE)

    surface.potentials <- by(I(firing.info$contribs.time),
                             factor(firing.info$electrode),
                             function(list) colSums(do.call(rbind, list)))

    surface.potentials
}


calc.MU.electrode.contrib <- function(.MU.firing.contrib, .MU.firing.instants) {

    ## This will in reality be more complicated since then .MU.firing.contrib
    ## will have a different (more complicated) structure.
    .MU.firing.contrib * .MU.firing.instants
}


firing.instants <- data.frame(muscle = c(1,1,2,2,2),
                              MU = c(1,2,1,2,3),
                              instants = I(list(c(F,F,T,F),c(F,T,F,F),
                                  c(F,F,F,F),c(T,F,T,F),c(F,T,T,F))))

firing.contribs <- data.frame(muscle = c(1,1,1,1,2,2,2,2,2,2),
                              MU = c(1,1,2,2,1,1,2,2,3,3),
                              electrode = c(1,2,1,2,1,2,1,2,1,2),
                              contrib = c(1,2,3,4,5,6,7,8,9,10))

surface.potentials <- sum.MU.firing.contributions(firing.contribs,
                                                  firing.instants)

+1! That is indeed a much simpler code to follow through! And a non-important note: if your `calc.MU.electrode.contrib` can't be a vectorized (`*` is vectorized) function, you could wrap you new function inside `Vectorize`; it won't be any faster at all (it's just a wrapper to `mapply`) but it may make the code a little bit cleaner. (`myvecfun = Vectorize(calc.MU.electrode.contrib, SIMPLIFY = F); myvecfun(firing.info$contrib, firing.info$instants)`). — alexis_laz, Jul 07 '14 at 15:40
Actually it can be vectorized in my case, but `Vectorize` is still really good to know about, thanks again! — Eike P., Jul 07 '14 at 16:01

How to structure R code for inherently nested problems to be easily readible?

1 Answers1