To follow-up on two rapply
questions, here and here from years ago, it seems rapply
only works on simple classes (i.e., vector, matrix) and not the multifaceted data.frame
class.
In most cases and demonstrated below, the rapply
equivalent is nested lapply
and its variant wrappers, v/sapply
where the number of nests correlates to number of levels. Below is my testing scenario between nested lapply
and rapply
between vector, matrix, and dataframe types. All but datafames fail to equalize.
Question
Is there a use case in base R for rapply()
to recursively run operations on a list of dataframes and return a list of dataframes as it does for lists of vectors or matrices? If not, is this a bug or should it be warned in ?rapply
base R docs? Most tutorials do not show rapply
dataframe examples.
One Dimension (character vector)
Below shows how rapply
is equivalent to nested lapply
on simple character vectors running count of characters, and even shows how rapply
is appreciably faster in processing:
library(microbenchmark)
ScriptLists <- list(R = list.files(path="/path/to/Scripts", pattern="\\.R"),
Python = list.files(path="/path/to/Scripts", pattern="\\.py"),
SQL = list.files(path="/path/to/Scripts", pattern="\\.sql"),
PHP = list.files(path="/path/to/Scripts", pattern="\\.xsl"),
XSLT = list.files(path="/path/to/Scripts", pattern="\\.php"))
microbenchmark(
ScriptsLists1 <- lapply(ScriptLists, function(i){
unname(vapply(i, function(x){
nchar(x)
}, numeric(1)))
})
)
# Unit: microseconds
# min lq mean median uq max neval
# 384 408.782 524.1363 434.7675 678.016 886.377 100
microbenchmark(
ScriptsLists2 <- rapply(ScriptLists, function(x){
nchar(x)
}, how="list")
)
# Unit: microseconds
# min lq mean median uq max neval
# 110.196 112.8425 131.6141 114.5265 123.91 352.722 100
all.equal(ScriptsLists1, ScriptsLists2)
# [1] TRUE
Two Dimension Type (matrix vs. data.frame)
Input dataframe (pulled from highest year rankings of StackOverflow top users) to build list of top users' dataframes by language tags (C#, Python, R, etc.).
df <- structure(list(user = structure(c(12L, 14L, 19L, 35L, 22L, 32L,
1L, 36L, 7L, 9L, 2L, 18L, 27L, 6L, 30L, 20L, 10L, 24L, 29L, 23L,
5L, 3L, 4L, 15L, 25L, 17L, 11L, 8L, 33L, 13L, 34L, 16L, 21L,
26L, 28L, 31L), .Label = c("akrun", "alecxe", "Alexey Mezenin",
"BalusC", "Barmar", "CommonsWare", "Darin Dimitrov", "dasblinkenlight",
"Eric Duminil", "Felix Kling", "Frank van Puffelen", "Gordon Linoff",
"Greg Hewgill", "Günter Zöchbauer", "GurV", "Hans Passant", "JB Nizet",
"Jean-François Fabre", "jezrael", "Jon Skeet", "Jonathan Leffler",
"Martijn Pieters", "Martin R", "matt", "Nina Scholz", "paxdiablo",
"piRSquared", "Pranav C Balan", "Psidom", "Quentin", "Suragch",
"T.J. Crowder", "Tim Biegeleisen", "unutbu", "VonC", "Wiktor Stribi?ew"
), class = "factor"), link = structure(c(2L, 17L, 21L, 31L, 1L,
10L, 27L, 28L, 22L, 33L, 35L, 34L, 20L, 3L, 15L, 19L, 18L, 25L,
29L, 4L, 8L, 5L, 11L, 32L, 6L, 30L, 16L, 24L, 13L, 36L, 14L,
12L, 9L, 7L, 23L, 26L), .Label = c("http://www.stackoverflow.com//users/100297/martijn-pieters",
"http://www.stackoverflow.com//users/1144035/gordon-linoff",
"http://www.stackoverflow.com//users/115145/commonsware", "http://www.stackoverflow.com//users/1187415/martin-r",
"http://www.stackoverflow.com//users/1227923/alexey-mezenin",
"http://www.stackoverflow.com//users/1447675/nina-scholz", "http://www.stackoverflow.com//users/14860/paxdiablo",
"http://www.stackoverflow.com//users/1491895/barmar", "http://www.stackoverflow.com//users/15168/jonathan-leffler",
"http://www.stackoverflow.com//users/157247/t-j-crowder", "http://www.stackoverflow.com//users/157882/balusc",
"http://www.stackoverflow.com//users/17034/hans-passant", "http://www.stackoverflow.com//users/1863229/tim-biegeleisen",
"http://www.stackoverflow.com//users/190597/unutbu", "http://www.stackoverflow.com//users/19068/quentin",
"http://www.stackoverflow.com//users/209103/frank-van-puffelen",
"http://www.stackoverflow.com//users/217408/g%c3%bcnter-z%c3%b6chbauer",
"http://www.stackoverflow.com//users/218196/felix-kling", "http://www.stackoverflow.com//users/22656/jon-skeet",
"http://www.stackoverflow.com//users/2336654/pirsquared", "http://www.stackoverflow.com//users/2901002/jezrael",
"http://www.stackoverflow.com//users/29407/darin-dimitrov", "http://www.stackoverflow.com//users/3037257/pranav-c-balan",
"http://www.stackoverflow.com//users/335858/dasblinkenlight",
"http://www.stackoverflow.com//users/341994/matt", "http://www.stackoverflow.com//users/3681880/suragch",
"http://www.stackoverflow.com//users/3732271/akrun", "http://www.stackoverflow.com//users/3832970/wiktor-stribi%c5%bcew",
"http://www.stackoverflow.com//users/4983450/psidom", "http://www.stackoverflow.com//users/571407/jb-nizet",
"http://www.stackoverflow.com//users/6309/vonc", "http://www.stackoverflow.com//users/6348498/gurv",
"http://www.stackoverflow.com//users/6419007/eric-duminil", "http://www.stackoverflow.com//users/6451573/jean-fran%c3%a7ois-fabre",
"http://www.stackoverflow.com//users/771848/alecxe", "http://www.stackoverflow.com//users/893/greg-hewgill"
), class = "factor"), location = structure(c(17L, 15L, 8L, 12L,
10L, 26L, 1L, 28L, 23L, 1L, 17L, 25L, 6L, 29L, 26L, 19L, 24L,
1L, 5L, 13L, 4L, 2L, 3L, 1L, 7L, 20L, 21L, 27L, 22L, 11L, 1L,
16L, 9L, 1L, 18L, 14L), .Label = c("", "??????", "Amsterdam, Netherlands",
"Arlington, MA", "Atlanta, GA, United States", "Bellevue, WA, United States",
"Berlin, Deutschland", "Bratislava, Slovakia", "California, USA",
"Cambridge, United Kingdom", "Christchurch, New Zealand", "France",
"Germany", "Hohhot, China", "Linz, Austria", "Madison, WI", "New York, United States",
"Ramanthali, Kannur, Kerala, India", "Reading, United Kingdom",
"Saint-Etienne, France", "San Francisco, CA", "Singapore", "Sofia, Bulgaria",
"Sunnyvale, CA", "Toulouse, France", "United Kingdom", "United States",
"Warsaw, Poland", "Who Wants to Know?"), class = "factor"), year_rep = structure(c(36L,
35L, 34L, 33L, 32L, 31L, 30L, 29L, 28L, 27L, 26L, 25L, 24L, 23L,
22L, 21L, 20L, 19L, 18L, 17L, 16L, 15L, 14L, 13L, 12L, 11L, 10L,
9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L), .Label = c("3,580", "3,604",
"3,636", "3,649", "3,688", "3,735", "3,796", "3,814", "3,886",
"3,920", "3,923", "3,950", "4,016", "4,046", "4,142", "4,179",
"4,195", "4,236", "4,313", "4,324", "4,348", "4,464", "4,475",
"4,482", "4,526", "4,723", "4,854", "4,936", "4,948", "5,188",
"5,258", "5,337", "5,577", "5,740", "5,835", "5,985"), class = "factor"),
total_rep = structure(c(18L, 2L, 34L, 27L, 22L, 20L, 5L,
3L, 31L, 1L, 6L, 9L, 13L, 25L, 21L, 36L, 14L, 4L, 11L, 7L,
8L, 10L, 30L, 29L, 24L, 15L, 35L, 17L, 33L, 23L, 12L, 28L,
16L, 19L, 26L, 32L), .Label = c("12,557", "154,439", "158,134",
"220,515", "229,553", "233,368", "269,380", "289,989", "30,027",
"31,602", "36,950", "401,595", "41,183", "411,535", "418,780",
"455,157", "475,813", "499,408", "507,043", "508,310", "509,365",
"525,176", "529,137", "61,135", "616,135", "64,476", "651,397",
"672,118", "7,932", "703,046", "709,683", "71,032", "77,211",
"83,237", "86,520", "921,690"), class = "factor"), tag1 = structure(c(15L,
2L, 10L, 6L, 11L, 8L, 12L, 13L, 4L, 14L, 11L, 11L, 10L, 1L,
8L, 4L, 8L, 16L, 11L, 16L, 8L, 9L, 7L, 15L, 8L, 7L, 5L, 4L,
15L, 6L, 11L, 4L, 3L, 3L, 8L, 16L), .Label = c("android",
"angular2", "c", "c#", "firebase", "git", "java", "javascript",
"laravel", "pandas", "python", "r", "regex", "ruby", "sql",
"swift"), class = "factor"), tag2 = structure(c(23L, 24L,
19L, 8L, 20L, 14L, 6L, 13L, 3L, 21L, 22L, 20L, 19L, 12L,
10L, 12L, 14L, 11L, 17L, 11L, 18L, 18L, 15L, 16L, 2L, 9L,
7L, 12L, 16L, 19L, 17L, 1L, 4L, 5L, 14L, 11L), .Label = c(".net",
"arrays", "asp.net-mvc", "bash", "c++", "dplyr", "firebase-database",
"github", "hibernate", "html", "ios", "java", "javascript",
"jquery", "jsf", "mysql", "pandas", "php", "python", "python-3.x",
"ruby-on-rails", "selenium", "sql-server", "typescript"), class = "factor"),
tag3 = structure(c(20L, 17L, 11L, 12L, 24L, 15L, 11L, 8L,
5L, 4L, 23L, 24L, 11L, 3L, 10L, 1L, 6L, 31L, 25L, 28L, 18L,
19L, 26L, 27L, 22L, 16L, 2L, 9L, 15L, 13L, 21L, 30L, 29L,
7L, 14L, 2L), .Label = c(".net", "android", "android-intent",
"arrays", "asp.net-mvc-3", "asynchronous", "bash", "c#",
"c++", "css", "dataframe", "docker", "git-pull", "html",
"java", "java-8", "javascript", "jquery", "laravel-5.3",
"mysql", "numpy", "object", "protractor", "python-2.7", "r",
"servlets", "sql-server", "swift3", "unix", "winforms", "xcode"
), class = "factor")), .Names = c("user", "link", "location",
"year_rep", "total_rep", "tag1", "tag2", "tag3"), class = "data.frame", row.names = c(NA,
-36L))
R Code
Below methods average year_rep and total_rep (5th/6th) columns in either types, matrix or dataframe. Be sure to change return statements in setup block, swapping out the commented section type. Notice only the rapply()
for matrix returns same as nested lapply
, but not for dataframe returns.
# NESTED LIST SETUP ------------------------------------
LangLists <- list(`c#`=list(), python=list(), sql=list(), php=list(), r=list(),
java=list(), javascript=list(), ruby=list(), `c++`=list())
LangLists <- setNames(mapply(function(i, j){
df <- subset(df, tag1 == j | tag2 == j | tag3 == j)
df$year_rep <- as.numeric(as.character(gsub(",", "", df$year_rep)))
df$total_rep <- as.numeric(as.character(gsub(",", "", df$total_rep)))
return(list(as.matrix(df))) # MATRIX TYPE
# return(list(df)) # DF TYPE
}, LangLists, names(LangLists), SIMPLIFY=FALSE), names(LangLists))
# -----------------------------------------------------
# MATRIX RETURN
LangLists1 <- lapply(LangLists, function(i){
lapply(i, function(df){
cbind(mean(as.numeric(df[,5])),
mean(as.numeric(df[,6])))
})
})
LangLists2 <- rapply(LangLists, function(i){
cbind(mean(as.numeric(i[,5])),
mean(as.numeric(i[,6])))
}, classes="matrix", how="list")
all.equal(LangLists1, LangLists2)
# [1] TRUE
# DATA FRAME RETURN
LangLists1 <- lapply(LangLists, function(i){
lapply(i, function(df){
data.frame(year_rep=mean(df$year_rep),
total_rep=mean(df$total_rep))
})
})
LangLists2 <- rapply(LangLists, function(i){
data.frame(year_rep=mean(i$year_rep),
total_rep=mean(i$total_rep))
}, classes="data.frame", how="list")
all.equal(LangLists1, LangLists2)
# [1] "Component “c#”: Component 1: Names: 2 string mismatches"
# [2] "Component “c#”: Component 1: Attributes: < names for target but not for current >"
# [3] "Component “c#”: Component 1: Attributes: < Length mismatch: comparison on first 0 components >"
# [4] "Component “c#”: Component 1: Length mismatch: comparison on first 2 components"
# [5] "Component “c#”: Component 1: Component 1: Modes: numeric, NULL"
...
In fact, whereas the nested lapply
remains a list of intact dataframes of the two columns for rep means, the rapply
for dataframes converts underlying dataframes to lists of NULLs. So again, why does rapply fail to return original list of dataframes compared to vectors/matrices?
# $`c#`
# $`c#`[[1]]
# $`c#`[[1]]$X
# NULL
# $`c#`[[1]]$user
# NULL
# $`c#`[[1]]$link
# NULL
# $`c#`[[1]]$location
# NULL
# $`c#`[[1]]$year_rep
# NULL
# $`c#`[[1]]$total_rep
# NULL
# $`c#`[[1]]$tag1
# NULL
# $`c#`[[1]]$tag2
# NULL
# $`c#`[[1]]$tag3
# NULL
# $python
# $python[[1]]
# $python[[1]]$X
# NULL
# $python[[1]]$user
# NULL
# $python[[1]]$link
# NULL
# $python[[1]]$location
# NULL
# $python[[1]]$year_rep
# NULL
# $python[[1]]$total_rep
# NULL
# $python[[1]]$tag1
# NULL
# $python[[1]]$tag2
# NULL
# $python[[1]]$tag3
# NULL