27

I have a data frame where all the variables are of character type. Many of the columns are completely empty, i.e. only the variable headers are there, but no values. Is there any way to subset out the empty columns?

ZygD
  • 22,092
  • 39
  • 79
  • 102
user702432
  • 11,898
  • 21
  • 55
  • 70

9 Answers9

29

If your empty columns are really empty character columns, something like the following should work. It will need to be modified if your "empty" character columns include, say, spaces.

Sample data:

mydf <- data.frame(
  A = c("a", "b"),
  B = c("y", ""),
  C = c("", ""),
  D = c("", ""),
  E = c("", "z")
)
mydf
#   A B C D E
# 1 a y      
# 2 b       z

Identifying and removing the "empty" columns.

mydf[!sapply(mydf, function(x) all(x == ""))]
#   A B E
# 1 a y  
# 2 b   z

Alternatively, as recommended by @Roland:

> mydf[, colSums(mydf != "") != 0]
  A B E
1 a y  
2 b   z
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • 6
    I just recommend to extend the condition in case there are NA values instead of "": `all(x == "" || is.na(x))`. But it will loose a bit of the elegance though :) – Tomas Jul 16 '13 at 09:36
  • 2
    `mydf[,colSums(mydf!="")!=0]` might be faster. – Roland Jul 16 '13 at 09:43
  • @Tomas, my *guess* here was that some data got read in incorrectly, leading to empty character columns, and `NA` is different than a read error, but good suggestion anyway. – A5C1D2H2I1M1N2O1R2T1 Jul 16 '13 at 09:45
  • Hi Ananda... Actually, the reading-in is correct. The external data is in the form of many blocks appended together. I have to parse out the individual blocks and throw out the irrelevant columns. – user702432 Jul 16 '13 at 09:49
  • 1
    @AnandaMahto Maybe `mydf[,colSums(mydf!="")>0]` is more to your taste? – Roland Jul 16 '13 at 09:51
  • @user702432, Cool. Glad it worked out for you. In the future, please do try to make a *reproducible example*, as I have done in this answer. – A5C1D2H2I1M1N2O1R2T1 Jul 16 '13 at 09:52
14

If you're talking about columns where all values are NA, use remove_empty("cols") from the janitor package.

If you have character vectors where every value is the empty string "", you can first convert those values to NA throughout your data.frame with na_if from the dplyr package:

dat <- data.frame(
  x = c("a", "b", "c"),
  y = c("", "", ""),
  z = c(NA, NA, NA),
  stringsAsFactors = FALSE
)

dat
#>   x y  z
#> 1 a   NA
#> 2 b   NA
#> 3 c   NA

library(dplyr)
library(janitor)

dat %>%
  mutate_all(funs(na_if(., ""))) %>%
  remove_empty("cols")
#>   x
#> 1 a
#> 2 b
#> 3 c
Sam Firke
  • 21,571
  • 9
  • 87
  • 105
13

You can do either of the following:

emptycols <- sapply(df, function (k) all(is.na(k)))
df <- df[!emptycols]

or:

emptycols <- colSums(is.na(df)) == nrow(df)
df <- df[!emptycols]

If by empty you mean they are "", the second approach can be adapted like so:

emptycols <- colSums(df == "") == nrow(df)
asb
  • 4,392
  • 1
  • 20
  • 30
  • This is throwing an error. I think is.na() works only on numeric variables. – user702432 Jul 16 '13 at 09:29
  • user702432, the error thrown is not because of what you're saying, `is.na` will work just fine on character vectors as well. It's because he's passing an anonymous function to another. @asb, I think you can't do that. you'll have to do `sapply(df, function(x) all(is.na(x)))`. – Arun Jul 16 '13 at 09:30
  • Oversight. Just fixed it. – asb Jul 16 '13 at 09:31
6

I have a similar situation -- I'm working with a large public records database but when I whittle it down to just the date range and category that I need, there are a ton of columns that aren't in use. Some are blank and some are NA.

The selected answer: https://stackoverflow.com/a/17672737/233467 didn't work for me, but this did:

df[!sapply(df, function (x) all(is.na(x) | x == ""))]
Community
  • 1
  • 1
Amanda
  • 12,099
  • 17
  • 63
  • 91
4

It depends what you mean by empty: Is it NA or "", or can it even be " "? Something like this might work:

df[,!apply(df, 2, function(x) all(gsub(" ", "", x)=="", na.rm=TRUE))]
4

This can also be done by dplyr and select_if

`select_if(df,function(x){any(!is.na(x))})`

or with is.null() or x=="" depending on how empty values are defined in your data.

Gauti
  • 83
  • 6
2

Here is something that can be modified to exclude columns containing any variables specied.

newdf= df[, apply(df, 2, function(x) !any({is.na(x) | x== "" | 
x== "-4"} ) )] 
Eddy Zavala
  • 549
  • 1
  • 4
  • 12
2

A simple solution using the purrr package:

purrr::discard(my_data_frame, ~all(is.na(.)))

Lior Polat
  • 41
  • 4
-1

If you know the column indices, you can use

df[,-c(3, 5, 7)]

This will omit columns 3, 5, 7.

Tomas
  • 57,621
  • 49
  • 238
  • 373