6

This is my first week working with R and there is one thing about function I cannot seems to manage.

df <- data.frame(a = c(1:10),
             b = c("a", "a", "b", "c", "c", "b", "a", "c", "c", "b"))

testF = function(select) {
dum = subset(df, b == select)
}

lapply(unique(df$b), testF)

This function now just prints the the data sets on screen. But I would like to store the results as separate data frames in my workspace. In this example this would give three data frames; a, b and c.

Thank for the help.

Joris Meys
  • 106,551
  • 31
  • 221
  • 263
user3122822
  • 127
  • 1
  • 2
  • 8

3 Answers3

2

Roland has the correct solution for the specific problem: more than a split() is not needed. Just to make sure: split() returns a list. To get separate data frames in you workspace, you do:

list2env(split(df,df$b),.GlobalEnv)

Or, using assign:

tmp <- split(df,df$b)
for(i in names(tmp)) assign(i,tmp[[i]])

A word on subset

This said, some more detail on why your function is plain wrong. First of all, in ?subset you read:

Warning

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

Translates to: Never ever in your life use subset() within a function again.


A word on returning values from a function

Next to that, a function always returns a result:

  • if a return() statement is used, it returns whatever is given as an argument to return().
  • otherwise it returns the result of the last line.

In your case, the last line contains an assignment. Now an assignment also returns a value, but you don't see it. It's returned invisibly. You can see it by wrapping it in parentheses, for example:

> x <- 10
> (x <- 20)
[1] 20

This is absolutely unnecessary. It's the reason why your function works when used in lapply() (lapply catches invisible output), but won't give you any (visible) output when used at the command line. You can capture it though :

> testF("b")
> x <- testF("b")
> x
    a b
3   3 b
6   6 b
10 10 b

The assignment in your function doesn't make sense: either you return dum explicitly, or you just drop the assignment alltogether


Correcting your function

So, given this is just an example and the real problem wouldn't be solved by simply using split() your function would be :

testF <- function(select) {
    dum <- df[df$b=select,]
    return(dum)
}

or simply:

testF <- function(select){
    df[df$b=select,]
}
Joris Meys
  • 106,551
  • 31
  • 221
  • 263
  • 1
    Thanks for your comprehensive and detailed reaction. And I just will never ever use the subset() again, even outside functions. – user3122822 Dec 20 '13 at 16:31
1

Your function needs a return value. See help("function") for details.

However, for your specific case you can simply use split:

split(df, df$b)

$a
  a b
1 1 a
2 2 a
7 7 a

$b
    a b
3   3 b
6   6 b
10 10 b

$c
  a b
4 4 c
5 5 c
8 8 c
9 9 c
Roland
  • 127,288
  • 10
  • 191
  • 288
  • 1
    Maybe it's interesting to point out that `split()` returns a list, that you can use `assign()` to create separate data frames and that this is in general a bad idea. Creating different data frames when you can just work with a list, that is. – Joris Meys Dec 20 '13 at 13:19
  • 1
    Or `list2env` if you have a list and for some strange reason "must" put its content into the global environment. – Roland Dec 20 '13 at 13:24
  • Many thanks for all the help. I will use split and work my way through the help pages. – user3122822 Dec 20 '13 at 13:38
1

A solution using the list2env function described in the comment above, assuming you wish to use the subset method regardless of potential issues inside a function.

df <- data.frame(a = c(1:10),
             b = c("a", "a", "b", "c", "c", "b", "a", "c", "c", "b"))

testF = function(select) {
    dum = subset(df, b == select)
    dum                                # you need to return the data frame resulting from the subset back out of the function
}

my.list = lapply(unique(df$b), testF)
names(my.list) = unique(df$b)          # set the names of the list elements to the subsets they represent (a,b,c)
list2env(my.list,envir = .GlobalEnv)   # copy the data frames from the list to the Global Environment

If you had a simple example like the one you portray you could access the elements of the list one-by-one as follows and assign each to a variable.

a = my.list[[1]]
b = my.list[[2]]
c = my.list[[3]]

Finally, you could define the function inline such and make use of the (awesome) data.table package, thereby avoiding the use of subset:

library(data.table)
dt <- data.table(a = c(1:10),
             b = c("a", "a", "b", "c", "c", "b", "a", "c", "c", "b"))   
my.list = lapply(unique(dt$b), function(select) { dt[b == eval(select)]})

Hope this helps.

Matt Weller
  • 2,684
  • 2
  • 21
  • 30
  • 1
    As specifically pointed out in `?subset` you're NOT supposed to use subset within a function. subset is a convenience function. Use `df[df$b==select,]` to avoid huge trouble at some point. – Joris Meys Dec 20 '13 at 14:00
  • Noted @Joris Meys, I've included an example at the bottom which avoids this. Should my answer therefore remove the `subset` reference? I'm not sure of good practice on answering these questions. Also, I included a `data.table` example as opposed to `df[df$b==select,]` purely because I think the package is so good (speed, memory, functionality) it pays to start using it over `data.frame` at an early stage! – Matt Weller Dec 20 '13 at 14:13
  • 1
    `data.table` is even more succinct if you set a key first: `setkey(dt, "b"); lapply(unique(dt$b), function(x) dt[x])` – A5C1D2H2I1M1N2O1R2T1 Dec 20 '13 at 15:35