I have a data frame where all the variables are of character type. Many of the columns are completely empty, i.e. only the variable headers are there, but no values. Is there any way to subset out the empty columns?
-
5Btw, by empty, do you mean `NA` or `""`? – asb Jul 16 '13 at 09:33
9 Answers
If your empty columns are really empty character columns, something like the following should work. It will need to be modified if your "empty" character columns include, say, spaces.
Sample data:
mydf <- data.frame(
A = c("a", "b"),
B = c("y", ""),
C = c("", ""),
D = c("", ""),
E = c("", "z")
)
mydf
# A B C D E
# 1 a y
# 2 b z
Identifying and removing the "empty" columns.
mydf[!sapply(mydf, function(x) all(x == ""))]
# A B E
# 1 a y
# 2 b z
Alternatively, as recommended by @Roland:
> mydf[, colSums(mydf != "") != 0]
A B E
1 a y
2 b z

- 190,393
- 28
- 405
- 485
-
6I just recommend to extend the condition in case there are NA values instead of "": `all(x == "" || is.na(x))`. But it will loose a bit of the elegance though :) – Tomas Jul 16 '13 at 09:36
-
2
-
@Tomas, my *guess* here was that some data got read in incorrectly, leading to empty character columns, and `NA` is different than a read error, but good suggestion anyway. – A5C1D2H2I1M1N2O1R2T1 Jul 16 '13 at 09:45
-
Hi Ananda... Actually, the reading-in is correct. The external data is in the form of many blocks appended together. I have to parse out the individual blocks and throw out the irrelevant columns. – user702432 Jul 16 '13 at 09:49
-
1
-
@user702432, Cool. Glad it worked out for you. In the future, please do try to make a *reproducible example*, as I have done in this answer. – A5C1D2H2I1M1N2O1R2T1 Jul 16 '13 at 09:52
If you're talking about columns where all values are NA
, use remove_empty("cols")
from the janitor package.
If you have character vectors where every value is the empty string ""
, you can first convert those values to NA
throughout your data.frame with na_if
from the dplyr package:
dat <- data.frame(
x = c("a", "b", "c"),
y = c("", "", ""),
z = c(NA, NA, NA),
stringsAsFactors = FALSE
)
dat
#> x y z
#> 1 a NA
#> 2 b NA
#> 3 c NA
library(dplyr)
library(janitor)
dat %>%
mutate_all(funs(na_if(., ""))) %>%
remove_empty("cols")
#> x
#> 1 a
#> 2 b
#> 3 c

- 21,571
- 9
- 87
- 105
-
1Neat solution! But depreciated. Package says: This function is deprecated, use `remove_empty("cols")` instead. – radek Aug 07 '18 at 06:07
-
You can do either of the following:
emptycols <- sapply(df, function (k) all(is.na(k)))
df <- df[!emptycols]
or:
emptycols <- colSums(is.na(df)) == nrow(df)
df <- df[!emptycols]
If by empty you mean they are ""
, the second approach can be adapted like so:
emptycols <- colSums(df == "") == nrow(df)

- 4,392
- 1
- 20
- 30
-
This is throwing an error. I think is.na() works only on numeric variables. – user702432 Jul 16 '13 at 09:29
-
user702432, the error thrown is not because of what you're saying, `is.na` will work just fine on character vectors as well. It's because he's passing an anonymous function to another. @asb, I think you can't do that. you'll have to do `sapply(df, function(x) all(is.na(x)))`. – Arun Jul 16 '13 at 09:30
-
I have a similar situation -- I'm working with a large public records database but when I whittle it down to just the date range and category that I need, there are a ton of columns that aren't in use. Some are blank and some are NA.
The selected answer: https://stackoverflow.com/a/17672737/233467 didn't work for me, but this did:
df[!sapply(df, function (x) all(is.na(x) | x == ""))]
It depends what you mean by empty: Is it NA or ""
, or can it even be " "
? Something like this might work:
df[,!apply(df, 2, function(x) all(gsub(" ", "", x)=="", na.rm=TRUE))]

- 41
- 1
This can also be done by dplyr
and select_if
`select_if(df,function(x){any(!is.na(x))})`
or with is.null()
or x==""
depending on how empty values are defined in your data.

- 83
- 6
Here is something that can be modified to exclude columns containing any variables specied.
newdf= df[, apply(df, 2, function(x) !any({is.na(x) | x== "" |
x== "-4"} ) )]

- 549
- 1
- 4
- 12
A simple solution using the purrr
package:
purrr::discard(my_data_frame, ~all(is.na(.)))

- 41
- 4
If you know the column indices, you can use
df[,-c(3, 5, 7)]
This will omit columns 3, 5, 7.

- 57,621
- 49
- 238
- 373
-
1Unfortunately I don't. There are 400+ variables, and the empty columns are all randomly placed. – user702432 Jul 16 '13 at 09:24