How can I fix this function so that I can run it against a dataframe?

Question

For the second time in two weeks, I'm working with data that includes a ton of empty columns. It's public records data, I'm only interested in one category. I suspect that other categories of the larger data set use these columns, but the subset I care about doesn't. So I filter out the records I don't want, and then I'd like to systematically cull the empty columns.

This question has a great method:

R: Remove multiple empty columns of character variables

empty_columns <- sapply(df, function (k) all(is.na(k) | k == ""))
df <- df[!empty_columns]

But I'd like to make that a function, so I can run it using the name of the data frame exactly once. Something like:

drop_empty_cols <- function(df) {
  empty_columns <- sapply(df, function (k) all(is.na(k) | k == ""))
  df <- df[!empty_columns]
}

drop_empty_cols(my_frame)

But ... the method above fails, and fails silently. Here's some sample data:

demo <- read.table(text="Real.Val All.NA Nothin.here
1      3.5     NA         tmp
2      3.0     NA         tmp
3      3.2     NA         tmp
4      3.1     NA         tmp
5      3.6     NA         tmp
6      3.9     NA         tmp" , header = TRUE)

demo$Nothin.here <- ""

(I'm sure there's a way to write a reproducible example with an empty column, but mine was choking. So this empties it after you create the frame.)

If I do drop_empty_cols(demo) I still have 6 obs. of 3 variables. If I do

empty_columns <- sapply(demo, function (k) all(is.na(k) | k == ""))
demo <- demo[!empty_columns]

I get the desired result: 6 obs. of 1 variable. But to reuse that I have to replace demo three times. Is it even possible to use a function to transform a data frame directly?

You are not returning anything from the function. Add a line `df` at the end of that function and it works fine. The confusion might about passing `df` and modifying in function and expecting it to be modified globally. That is not how it works. — Gopala, May 20 '16 at 01:13

score 2 · Accepted Answer · answered May 20 '16 at 01:31

I think your problem is pretty much boils down to one of scope. In R when you call a function, everything created in that function is local and not accessible outside that function. So when you are passing your demo dataframe to the function is it manipulating it inside that function but it is not accessible outside the function. In order to get the result out of the function people usually return a value and assign the result. Such as:

add<- function(x,y) { return(x+y)}
res <- add(1,2)
> res
[1] 3

While this is the case in your specific example, you can, if you really want to, manipulate your demo object within your function call. You can do this by using the global assignment operator <<- however this is strongly recommended against.

Anyway for the answer, I think there are 2 ways you go about solving your problem. 1 good and 1 bad. The good way is by returning your manipulated dataframe at the end of your function which you can then store. This is done by:

drop_empty_cols <- function(df) {
  empty_columns <- sapply(df, function (k) all(is.na(k) | k == ""))
  return(df[!empty_columns])
}

res<-drop_empty_cols(demo)

 str(res)
'data.frame':   6 obs. of  1 variable:
 $ Real.Val: num  3.5 3 3.2 3.1 3.6 3.9

Here we can see the output is 6 observations and 1 variable as expected.

On the other hand you can use the global assignment operator (which I personally don't like because things can get confusing and you can overwrite results unknowingly). The code for this method is:

drop_empty_cols <- function(df) {
  empty_columns <- sapply(df, function (k) all(is.na(k) | k == ""))
  demo <<- (df[!empty_columns])
}

drop_empty_cols(demo)

str(demo)
'data.frame':   6 obs. of  1 variable:
 $ Real.Val: num  3.5 3 3.2 3.1 3.6 3.9

This gives the same output as the above method. However, note that we don't actually store anything, we can simply call the function to manipulate the demo data. Furthermore, any function call will overwrite your demo data since that is fixed in demo <<- (df[!empty_columns])

You warn against the bad solution, but is it really necessary to show it? `<<-` is an operator only very advanced R users should use. Someone who doesn't know that a function should have a return value shouldn't even be told that it exists. — Roland, May 20 '16 at 07:12
Wow, @Roland. Harsh. I actually found this answer super helpful. Hiding information from those who aren't ready for it is some kind of nonsense. — Amanda, May 20 '16 at 18:06

How can I fix this function so that I can run it against a dataframe?

1 Answers1