1

Similar to Deleting column with zero values in R?

sample data

a <- c(0,2,5,7,2,3,0,3)
b <- c(2,3,0,0,1,0,4,0)
c <- c(0,0,0,0,0,0,0,0)
d <- c(2,5,1,2,3,4,5,6)

df <- data.frame(a,b,c,d)

but I only want to get a data.frame with a,b,d as columns

AdIan
  • 125
  • 1
  • 6

3 Answers3

4

One option using dplyr could be:

df %>%
 select(where(~ any(. != 0)))

1 0 2 2
2 2 3 5
3 5 0 1
4 7 0 2
5 2 1 3
6 3 0 4
7 0 4 5
8 3 0 6
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
2

For a base R option, you could use colSums:

df[, colSums(df) != 0]

  a b d
1 0 2 2
2 2 3 5
3 5 0 1
4 7 0 2
5 2 1 3
6 3 0 4
7 0 4 5
8 3 0 6

The expression colSums(df) != 0 is boolean, and will be true only for those columns which do not have all zeroes. Note that this answer assumes that you only expect to have positive values in the columns.

One way to phrase an answer which strictly finds columns which do not have all zeroes would be to assert that either the min or max value of that column is not zero:

colMax <- sapply(df, max, na.rm=TRUE)
colMin <- sapply(df, min, na.rm=TRUE)
df[, colMin != 0 | colMax != 0]
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
1

Using base-r only, you can use apply(df, 2, function(x) all(x == 0)) to get only columns that only have zero values. Assigning NULL to these columns deletes the values.

a <- c(0,2,5,7,2,3,0,3)
b <- c(2,3,0,0,1,0,4,0)
c <- c(0,0,0,0,0,0,0,0)
d <- c(2,5,1,2,3,4,5,6)

df <- data.frame(a,b,c,d)

df[apply(df, 2, function(x) all(x == 0))] <- NULL
df
#>   a b d
#> 1 0 2 2
#> 2 2 3 5
#> 3 5 0 1
#> 4 7 0 2
#> 5 2 1 3
#> 6 3 0 4
#> 7 0 4 5
#> 8 3 0 6

Quick Benchmark

If you are interested in speed (and not necessarily code readability (can be debated...)):

library(dplyr)
dplyr_version <- function(d) {
  d %>%
    select(where(~ any(. != 0)))
}
base_version <- function(d) {
  d[apply(df, 2, function(x) all(x == 0))] <- NULL
  d
}
colsum_version <- function(d) {
  d[, colSums(d) != 0]
}


bench::mark(
  dplyr_version(df),
  base_version(df),
  colsum_version(df)
)
#> # A tibble: 3 x 13
#>   expression          min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
#>   <bch:expr>        <bch> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
#> 1 dplyr_version(df) 883µs 928.5µs     1057.    1.07MB     24.3   478    11      452ms <df[,…
#> 2 base_version(df)   70µs  77.6µs    11860.      480B     26.6  5344    12      451ms <df[,…
#> 3 colsum_version(df)  41.2µs  45.1µs    21580.      240B     15.1  9993     #> 7      463ms
#> # … with 3 more variables: memory <list>, time <list>, gc <list>

And testing for a larger dataset:

# Testing for a larger file
set.seed(251)
large_df <- df %>% sample_n(1e7, replace = TRUE)
bench::mark(
  dplyr_version(large_df),
  base_version(
  colsum_version(large_df)
)

#> # A tibble: 3 x 13
#>   expression                 min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#>   <bch:expr>              <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#> 1 dplyr_version(large_df) 77.5ms 85.6ms      12.0     114MB     12.0     3     3      250ms
#> 2 base_version(large_df)  65.8µs 69.5µs   14067.       480B     12.6  6720     6      478ms
#> 3 colsum_version(large_df) 121.6ms 122.1ms      8.19     229MB     8.19     2     2
#> # … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>

We see that the base version is in this case faster on larger datasets.

David
  • 9,216
  • 4
  • 45
  • 78