Running rapply on lists of dataframes

Question

To follow-up on two rapply questions, here and here from years ago, it seems rapply only works on simple classes (i.e., vector, matrix) and not the multifaceted data.frame class.

In most cases and demonstrated below, the rapply equivalent is nested lapply and its variant wrappers, v/sapply where the number of nests correlates to number of levels. Below is my testing scenario between nested lapply and rapply between vector, matrix, and dataframe types. All but datafames fail to equalize.

Question

Is there a use case in base R for rapply() to recursively run operations on a list of dataframes and return a list of dataframes as it does for lists of vectors or matrices? If not, is this a bug or should it be warned in ?rapply base R docs? Most tutorials do not show rapply dataframe examples.

One Dimension (character vector)

Below shows how rapply is equivalent to nested lapply on simple character vectors running count of characters, and even shows how rapply is appreciably faster in processing:

library(microbenchmark)

ScriptLists <- list(R = list.files(path="/path/to/Scripts", pattern="\\.R"),
                    Python = list.files(path="/path/to/Scripts", pattern="\\.py"),
                    SQL = list.files(path="/path/to/Scripts", pattern="\\.sql"),
                    PHP = list.files(path="/path/to/Scripts", pattern="\\.xsl"),
                    XSLT = list.files(path="/path/to/Scripts", pattern="\\.php"))

microbenchmark(
  ScriptsLists1 <- lapply(ScriptLists, function(i){
    unname(vapply(i, function(x){ 
      nchar(x)
      }, numeric(1)))
    })
)
# Unit: microseconds
# min      lq     mean   median      uq     max neval
# 384 408.782 524.1363 434.7675 678.016 886.377   100

microbenchmark(
  ScriptsLists2 <- rapply(ScriptLists, function(x){
    nchar(x)
  }, how="list")
)
# Unit: microseconds
# min           lq     mean   median     uq     max neval
# 110.196 112.8425 131.6141 114.5265 123.91 352.722   100

all.equal(ScriptsLists1, ScriptsLists2)
# [1] TRUE

Two Dimension Type (matrix vs. data.frame)

Input dataframe (pulled from highest year rankings of StackOverflow top users) to build list of top users' dataframes by language tags (C#, Python, R, etc.).

df <- structure(list(user = structure(c(12L, 14L, 19L, 35L, 22L, 32L, 
1L, 36L, 7L, 9L, 2L, 18L, 27L, 6L, 30L, 20L, 10L, 24L, 29L, 23L, 
5L, 3L, 4L, 15L, 25L, 17L, 11L, 8L, 33L, 13L, 34L, 16L, 21L, 
26L, 28L, 31L), .Label = c("akrun", "alecxe", "Alexey Mezenin", 
"BalusC", "Barmar", "CommonsWare", "Darin Dimitrov", "dasblinkenlight", 
"Eric Duminil", "Felix Kling", "Frank van Puffelen", "Gordon Linoff", 
"Greg Hewgill", "Günter Zöchbauer", "GurV", "Hans Passant", "JB Nizet", 
"Jean-François Fabre", "jezrael", "Jon Skeet", "Jonathan Leffler", 
"Martijn Pieters", "Martin R", "matt", "Nina Scholz", "paxdiablo", 
"piRSquared", "Pranav C Balan", "Psidom", "Quentin", "Suragch", 
"T.J. Crowder", "Tim Biegeleisen", "unutbu", "VonC", "Wiktor Stribi?ew"
), class = "factor"), link = structure(c(2L, 17L, 21L, 31L, 1L, 
10L, 27L, 28L, 22L, 33L, 35L, 34L, 20L, 3L, 15L, 19L, 18L, 25L, 
29L, 4L, 8L, 5L, 11L, 32L, 6L, 30L, 16L, 24L, 13L, 36L, 14L, 
12L, 9L, 7L, 23L, 26L), .Label = c("http://www.stackoverflow.com//users/100297/martijn-pieters", 
"http://www.stackoverflow.com//users/1144035/gordon-linoff", 
"http://www.stackoverflow.com//users/115145/commonsware", "http://www.stackoverflow.com//users/1187415/martin-r", 
"http://www.stackoverflow.com//users/1227923/alexey-mezenin", 
"http://www.stackoverflow.com//users/1447675/nina-scholz", "http://www.stackoverflow.com//users/14860/paxdiablo", 
"http://www.stackoverflow.com//users/1491895/barmar", "http://www.stackoverflow.com//users/15168/jonathan-leffler", 
"http://www.stackoverflow.com//users/157247/t-j-crowder", "http://www.stackoverflow.com//users/157882/balusc", 
"http://www.stackoverflow.com//users/17034/hans-passant", "http://www.stackoverflow.com//users/1863229/tim-biegeleisen", 
"http://www.stackoverflow.com//users/190597/unutbu", "http://www.stackoverflow.com//users/19068/quentin", 
"http://www.stackoverflow.com//users/209103/frank-van-puffelen", 
"http://www.stackoverflow.com//users/217408/g%c3%bcnter-z%c3%b6chbauer", 
"http://www.stackoverflow.com//users/218196/felix-kling", "http://www.stackoverflow.com//users/22656/jon-skeet", 
"http://www.stackoverflow.com//users/2336654/pirsquared", "http://www.stackoverflow.com//users/2901002/jezrael", 
"http://www.stackoverflow.com//users/29407/darin-dimitrov", "http://www.stackoverflow.com//users/3037257/pranav-c-balan", 
"http://www.stackoverflow.com//users/335858/dasblinkenlight", 
"http://www.stackoverflow.com//users/341994/matt", "http://www.stackoverflow.com//users/3681880/suragch", 
"http://www.stackoverflow.com//users/3732271/akrun", "http://www.stackoverflow.com//users/3832970/wiktor-stribi%c5%bcew", 
"http://www.stackoverflow.com//users/4983450/psidom", "http://www.stackoverflow.com//users/571407/jb-nizet", 
"http://www.stackoverflow.com//users/6309/vonc", "http://www.stackoverflow.com//users/6348498/gurv", 
"http://www.stackoverflow.com//users/6419007/eric-duminil", "http://www.stackoverflow.com//users/6451573/jean-fran%c3%a7ois-fabre", 
"http://www.stackoverflow.com//users/771848/alecxe", "http://www.stackoverflow.com//users/893/greg-hewgill"
), class = "factor"), location = structure(c(17L, 15L, 8L, 12L, 
10L, 26L, 1L, 28L, 23L, 1L, 17L, 25L, 6L, 29L, 26L, 19L, 24L, 
1L, 5L, 13L, 4L, 2L, 3L, 1L, 7L, 20L, 21L, 27L, 22L, 11L, 1L, 
16L, 9L, 1L, 18L, 14L), .Label = c("", "??????", "Amsterdam, Netherlands", 
"Arlington, MA", "Atlanta, GA, United States", "Bellevue, WA, United States", 
"Berlin, Deutschland", "Bratislava, Slovakia", "California, USA", 
"Cambridge, United Kingdom", "Christchurch, New Zealand", "France", 
"Germany", "Hohhot, China", "Linz, Austria", "Madison, WI", "New York, United States", 
"Ramanthali, Kannur, Kerala, India", "Reading, United Kingdom", 
"Saint-Etienne, France", "San Francisco, CA", "Singapore", "Sofia, Bulgaria", 
"Sunnyvale, CA", "Toulouse, France", "United Kingdom", "United States", 
"Warsaw, Poland", "Who Wants to Know?"), class = "factor"), year_rep = structure(c(36L, 
35L, 34L, 33L, 32L, 31L, 30L, 29L, 28L, 27L, 26L, 25L, 24L, 23L, 
22L, 21L, 20L, 19L, 18L, 17L, 16L, 15L, 14L, 13L, 12L, 11L, 10L, 
9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L), .Label = c("3,580", "3,604", 
"3,636", "3,649", "3,688", "3,735", "3,796", "3,814", "3,886", 
"3,920", "3,923", "3,950", "4,016", "4,046", "4,142", "4,179", 
"4,195", "4,236", "4,313", "4,324", "4,348", "4,464", "4,475", 
"4,482", "4,526", "4,723", "4,854", "4,936", "4,948", "5,188", 
"5,258", "5,337", "5,577", "5,740", "5,835", "5,985"), class = "factor"), 
    total_rep = structure(c(18L, 2L, 34L, 27L, 22L, 20L, 5L, 
    3L, 31L, 1L, 6L, 9L, 13L, 25L, 21L, 36L, 14L, 4L, 11L, 7L, 
    8L, 10L, 30L, 29L, 24L, 15L, 35L, 17L, 33L, 23L, 12L, 28L, 
    16L, 19L, 26L, 32L), .Label = c("12,557", "154,439", "158,134", 
    "220,515", "229,553", "233,368", "269,380", "289,989", "30,027", 
    "31,602", "36,950", "401,595", "41,183", "411,535", "418,780", 
    "455,157", "475,813", "499,408", "507,043", "508,310", "509,365", 
    "525,176", "529,137", "61,135", "616,135", "64,476", "651,397", 
    "672,118", "7,932", "703,046", "709,683", "71,032", "77,211", 
    "83,237", "86,520", "921,690"), class = "factor"), tag1 = structure(c(15L, 
    2L, 10L, 6L, 11L, 8L, 12L, 13L, 4L, 14L, 11L, 11L, 10L, 1L, 
    8L, 4L, 8L, 16L, 11L, 16L, 8L, 9L, 7L, 15L, 8L, 7L, 5L, 4L, 
    15L, 6L, 11L, 4L, 3L, 3L, 8L, 16L), .Label = c("android", 
    "angular2", "c", "c#", "firebase", "git", "java", "javascript", 
    "laravel", "pandas", "python", "r", "regex", "ruby", "sql", 
    "swift"), class = "factor"), tag2 = structure(c(23L, 24L, 
    19L, 8L, 20L, 14L, 6L, 13L, 3L, 21L, 22L, 20L, 19L, 12L, 
    10L, 12L, 14L, 11L, 17L, 11L, 18L, 18L, 15L, 16L, 2L, 9L, 
    7L, 12L, 16L, 19L, 17L, 1L, 4L, 5L, 14L, 11L), .Label = c(".net", 
    "arrays", "asp.net-mvc", "bash", "c++", "dplyr", "firebase-database", 
    "github", "hibernate", "html", "ios", "java", "javascript", 
    "jquery", "jsf", "mysql", "pandas", "php", "python", "python-3.x", 
    "ruby-on-rails", "selenium", "sql-server", "typescript"), class = "factor"), 
    tag3 = structure(c(20L, 17L, 11L, 12L, 24L, 15L, 11L, 8L, 
    5L, 4L, 23L, 24L, 11L, 3L, 10L, 1L, 6L, 31L, 25L, 28L, 18L, 
    19L, 26L, 27L, 22L, 16L, 2L, 9L, 15L, 13L, 21L, 30L, 29L, 
    7L, 14L, 2L), .Label = c(".net", "android", "android-intent", 
    "arrays", "asp.net-mvc-3", "asynchronous", "bash", "c#", 
    "c++", "css", "dataframe", "docker", "git-pull", "html", 
    "java", "java-8", "javascript", "jquery", "laravel-5.3", 
    "mysql", "numpy", "object", "protractor", "python-2.7", "r", 
    "servlets", "sql-server", "swift3", "unix", "winforms", "xcode"
    ), class = "factor")), .Names = c("user", "link", "location", 
"year_rep", "total_rep", "tag1", "tag2", "tag3"), class = "data.frame", row.names = c(NA, 
-36L))

R Code

Below methods average year_rep and total_rep (5th/6th) columns in either types, matrix or dataframe. Be sure to change return statements in setup block, swapping out the commented section type. Notice only the rapply() for matrix returns same as nested lapply, but not for dataframe returns.

# NESTED LIST SETUP ------------------------------------
LangLists <- list(`c#`=list(), python=list(), sql=list(), php=list(), r=list(),
                  java=list(), javascript=list(), ruby=list(), `c++`=list())

LangLists <- setNames(mapply(function(i, j){

  df <- subset(df, tag1 == j | tag2 == j | tag3 == j)
  df$year_rep <- as.numeric(as.character(gsub(",", "", df$year_rep)))
  df$total_rep <- as.numeric(as.character(gsub(",", "", df$total_rep)))

  return(list(as.matrix(df)))   # MATRIX TYPE
  # return(list(df))            # DF TYPE

}, LangLists, names(LangLists), SIMPLIFY=FALSE), names(LangLists))
# -----------------------------------------------------

# MATRIX RETURN
LangLists1 <- lapply(LangLists, function(i){
  lapply(i, function(df){         
    cbind(mean(as.numeric(df[,5])),
          mean(as.numeric(df[,6])))        
  })
})

LangLists2 <- rapply(LangLists, function(i){      
  cbind(mean(as.numeric(i[,5])),
        mean(as.numeric(i[,6])))      
}, classes="matrix", how="list")

all.equal(LangLists1, LangLists2)
# [1] TRUE


# DATA FRAME RETURN
LangLists1 <- lapply(LangLists, function(i){
  lapply(i, function(df){         
    data.frame(year_rep=mean(df$year_rep),
               total_rep=mean(df$total_rep))        
  })
})

LangLists2 <- rapply(LangLists, function(i){      
    data.frame(year_rep=mean(i$year_rep),
               total_rep=mean(i$total_rep))      
}, classes="data.frame", how="list")

all.equal(LangLists1, LangLists2)

# [1] "Component “c#”: Component 1: Names: 2 string mismatches"                                               
# [2] "Component “c#”: Component 1: Attributes: < names for target but not for current >"                     
# [3] "Component “c#”: Component 1: Attributes: < Length mismatch: comparison on first 0 components >"        
# [4] "Component “c#”: Component 1: Length mismatch: comparison on first 2 components"                        
# [5] "Component “c#”: Component 1: Component 1: Modes: numeric, NULL"  
...

In fact, whereas the nested lapply remains a list of intact dataframes of the two columns for rep means, the rapply for dataframes converts underlying dataframes to lists of NULLs. So again, why does rapply fail to return original list of dataframes compared to vectors/matrices?

# $`c#`
# $`c#`[[1]]
# $`c#`[[1]]$X
# NULL

# $`c#`[[1]]$user
# NULL

# $`c#`[[1]]$link
# NULL

# $`c#`[[1]]$location
# NULL

# $`c#`[[1]]$year_rep
# NULL

# $`c#`[[1]]$total_rep
# NULL

# $`c#`[[1]]$tag1
# NULL

# $`c#`[[1]]$tag2
# NULL

# $`c#`[[1]]$tag3
# NULL

# $python
# $python[[1]]
# $python[[1]]$X
# NULL

# $python[[1]]$user
# NULL

# $python[[1]]$link
# NULL

# $python[[1]]$location
# NULL

# $python[[1]]$year_rep
# NULL

# $python[[1]]$total_rep
# NULL

# $python[[1]]$tag1
# NULL

# $python[[1]]$tag2
# NULL

# $python[[1]]$tag3
# NULL

@rawr - so if `rapply` doesn't work, is this bug or should `rapply` be warned to users on `data.frame` class? — Parfait, Jan 23 '17 at 18:53
It appears that `rapply` will not work for a list of `data.frames`. From the details section of `?rapply`: if "how = "list" or how = "unlist", the list is copied, *all non-list* elements which have a class included in classes are replaced by the result of applying f to the element and *all others* are replaced by *deflt*," where the default setting of deflt is NULL. — lmo, Jan 23 '17 at 19:28

score 3 · Accepted Answer · answered Jan 24 '17 at 12:40

It appears that rapply is not designed to process lists of data.frames.

From the Details section of ?rapply it says, if

how = "list" or how = "unlist", the list is copied, all non-list elements which have a class included in classes are replaced by the result of applying f to the element and all others are replaced by deflt.

Since data.frames are lists, they do not fall under the first category. Thus, they fall into the all others catch-all and are replaced by dflt, whose default value is NULL. This explains the result of the final line of code in the question.

The final alternative argument to how is "replace" and it appears that data.frames are simply ignored under this "mode"

If how = "replace", each element of the list which is not itself a list and has a class included in classes is replaced by the result of applying f to the element.

No mention of elements which are themselves lists and running the code above with how="replace" appears to return a nested list where what were data.frames are now simple lists. So it appears that rapply went through and stripped the class attribute.

Running rapply on lists of dataframes

1 Answers1

Linked