Similar to Deleting column with zero values in R?
sample data
a <- c(0,2,5,7,2,3,0,3)
b <- c(2,3,0,0,1,0,4,0)
c <- c(0,0,0,0,0,0,0,0)
d <- c(2,5,1,2,3,4,5,6)
df <- data.frame(a,b,c,d)
but I only want to get a data.frame with a,b,d as columns
Similar to Deleting column with zero values in R?
a <- c(0,2,5,7,2,3,0,3)
b <- c(2,3,0,0,1,0,4,0)
c <- c(0,0,0,0,0,0,0,0)
d <- c(2,5,1,2,3,4,5,6)
df <- data.frame(a,b,c,d)
but I only want to get a data.frame with a,b,d as columns
One option using dplyr
could be:
df %>%
select(where(~ any(. != 0)))
1 0 2 2
2 2 3 5
3 5 0 1
4 7 0 2
5 2 1 3
6 3 0 4
7 0 4 5
8 3 0 6
For a base R option, you could use colSums
:
df[, colSums(df) != 0]
a b d
1 0 2 2
2 2 3 5
3 5 0 1
4 7 0 2
5 2 1 3
6 3 0 4
7 0 4 5
8 3 0 6
The expression colSums(df) != 0
is boolean, and will be true only for those columns which do not have all zeroes. Note that this answer assumes that you only expect to have positive values in the columns.
One way to phrase an answer which strictly finds columns which do not have all zeroes would be to assert that either the min or max value of that column is not zero:
colMax <- sapply(df, max, na.rm=TRUE)
colMin <- sapply(df, min, na.rm=TRUE)
df[, colMin != 0 | colMax != 0]
Using base-r only, you can use apply(df, 2, function(x) all(x == 0))
to get only columns that only have zero values. Assigning NULL
to these columns deletes the values.
a <- c(0,2,5,7,2,3,0,3)
b <- c(2,3,0,0,1,0,4,0)
c <- c(0,0,0,0,0,0,0,0)
d <- c(2,5,1,2,3,4,5,6)
df <- data.frame(a,b,c,d)
df[apply(df, 2, function(x) all(x == 0))] <- NULL
df
#> a b d
#> 1 0 2 2
#> 2 2 3 5
#> 3 5 0 1
#> 4 7 0 2
#> 5 2 1 3
#> 6 3 0 4
#> 7 0 4 5
#> 8 3 0 6
If you are interested in speed (and not necessarily code readability (can be debated...)):
library(dplyr)
dplyr_version <- function(d) {
d %>%
select(where(~ any(. != 0)))
}
base_version <- function(d) {
d[apply(df, 2, function(x) all(x == 0))] <- NULL
d
}
colsum_version <- function(d) {
d[, colSums(d) != 0]
}
bench::mark(
dplyr_version(df),
base_version(df),
colsum_version(df)
)
#> # A tibble: 3 x 13
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result
#> <bch:expr> <bch> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list>
#> 1 dplyr_version(df) 883µs 928.5µs 1057. 1.07MB 24.3 478 11 452ms <df[,…
#> 2 base_version(df) 70µs 77.6µs 11860. 480B 26.6 5344 12 451ms <df[,…
#> 3 colsum_version(df) 41.2µs 45.1µs 21580. 240B 15.1 9993 #> 7 463ms
#> # … with 3 more variables: memory <list>, time <list>, gc <list>
And testing for a larger dataset:
# Testing for a larger file
set.seed(251)
large_df <- df %>% sample_n(1e7, replace = TRUE)
bench::mark(
dplyr_version(large_df),
base_version(
colsum_version(large_df)
)
#> # A tibble: 3 x 13
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
#> <bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#> 1 dplyr_version(large_df) 77.5ms 85.6ms 12.0 114MB 12.0 3 3 250ms
#> 2 base_version(large_df) 65.8µs 69.5µs 14067. 480B 12.6 6720 6 478ms
#> 3 colsum_version(large_df) 121.6ms 122.1ms 8.19 229MB 8.19 2 2
#> # … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>
We see that the base version is in this case faster on larger datasets.