How to remove rows where all columns are zero using dplyr pipe

Question

I have the following data frame:

dat <- structure(list(`A-XXX` = c(1.51653275922944, 0.077037240321129, 
0), `fBM-XXX` = c(2.22875185527511, 0, 0), `P-XXX` = c(1.73356698481106, 
0, 0), `vBM-XXX` = c(3.00397859609183, 0, 0)), .Names = c("A-XXX", 
"fBM-XXX", "P-XXX", "vBM-XXX"), row.names = c("BATF::JUN_AHR", 
"BATF::JUN_CCR9", "BATF::JUN_IL10"), class = "data.frame")

dat 
#>                     A-XXX  fBM-XXX    P-XXX  vBM-XXX
#> BATF::JUN_AHR  1.51653276 2.228752 1.733567 3.003979
#> BATF::JUN_CCR9 0.07703724 0.000000 0.000000 0.000000
#> BATF::JUN_IL10 0.00000000 0.000000 0.000000 0.000000

I can remove the row with all column zero with this command:

> dat <- dat[ rowSums(dat)!=0, ]
> dat
                    A-XXX  fBM-XXX    P-XXX  vBM-XXX
BATF::JUN_AHR  1.51653276 2.228752 1.733567 3.003979
BATF::JUN_CCR9 0.07703724 0.000000 0.000000 0.000000

But how can I do it with dplyr's pipe style?

You said you wanted to keep the row names. That is what my code is doing — akrun, Mar 15 '18 at 06:54

talat · Accepted Answer · 2018-03-15T06:52:25.917

12

Here's a dplyr option:

library(dplyr)
filter_all(dat, any_vars(. != 0))

#       A-XXX  fBM-XXX    P-XXX  vBM-XXX
#1 1.51653276 2.228752 1.733567 3.003979
#2 0.07703724 0.000000 0.000000 0.000000

Here we make use of the logic that if any variable is not equal to zero, we will keep it. It's the same as removing rows where all variables are equal to zero.

Regarding row.names:

library(tidyverse)
dat %>% rownames_to_column() %>% filter_at(vars(-rowname), any_vars(. != 0))
#         rowname      A-XXX  fBM-XXX    P-XXX  vBM-XXX
#1  BATF::JUN_AHR 1.51653276 2.228752 1.733567 3.003979
#2 BATF::JUN_CCR9 0.07703724 0.000000 0.000000 0.000000

edited Mar 15 '18 at 06:52

answered Mar 15 '18 at 06:46

talat

68,970
21
126
157

Need to preserve the row names. How can I modify your code? – littleworth Mar 15 '18 at 06:48
1

@scamander, see my update. In general, you don't want to work with row.names when using dplyr (or data.table). Instead, always store that information in a separate column. – talat Mar 15 '18 at 06:52
how to do that with dplyr and across() rather than _all? – Art Jan 07 '21 at 09:45
1

@Art, you can use `if_any`, I found an example in the `across()` help page. I tailored this example and posted a new answer. – hnagaty Apr 12 '21 at 05:51

Agile Bean · Answer 2 · 2022-11-11T04:08:12.417

Update 2022-11-11

With the latest tidyverse packages, across() in filter() is deprecated. The updated solution now is:

data %>% filter(if_all(everything(.), ~. != 0))

Old solution (depecrated)

Adding to the answer by @mgrund, a shorter alternative with dplyr 1.0.0 is:

# Option A:
data %>% filter(across(everything(.)) != 0))

# Option B:
data %>% filter(across(everything(.), ~. != 0))

Explanation:
across() checks for every tidy_select variable, which is everything() representing every column. In Option A, every column is checked if not zero, which adds up to a complete row of zeros in every column. In Option B, on every column, the formula (~) is applied which checks if the current column is zero.

EDIT:
As filter already checks by row, you don't need rowwise(). This is different for select or mutate.

IMPORTANT:
In Option A, it is crucial to write across(everything(.)) != 0,
and NOT across(everything(.) != 0))!

Reason:
across requires a tidyselect variable (here everything()), not a boolean (which would be everything(.) != 0))

Error: Problem with `filter()` input `..1`. x Must subset columns with a valid subscript vector. x Subscript has the wrong type `logical`. i It must be numeric or character. i Input `..1` is `across(everything() != 0)`. i The error occurred in row 1. — Art, Jan 07 '21 at 09:43
You are right, my version only works without `rowwise`. Corrected! — Agile Bean, Jan 08 '21 at 05:25

score 4 · Answer 3 · answered Apr 12 '21 at 05:48

You can use the new if_any(). I tailored an example found in the documentation of if_any()

library(dplyr)
library(tibble)
dat <- structure(list(`A-XXX` = c(1.51653275922944, 0.077037240321129, 
                                  0), `fBM-XXX` = c(2.22875185527511, 0, 0), `P-XXX` = c(1.73356698481106, 
                                                                                         0, 0), `vBM-XXX` = c(3.00397859609183, 0, 0)), .Names = c("A-XXX", 
                                                                                                                                                   "fBM-XXX", "P-XXX", "vBM-XXX"), row.names = c("BATF::JUN_AHR", 
                                                                                                                                                                                                 "BATF::JUN_CCR9", "BATF::JUN_IL10"), class = "data.frame")
dat
#>                     A-XXX  fBM-XXX    P-XXX  vBM-XXX
#> BATF::JUN_AHR  1.51653276 2.228752 1.733567 3.003979
#> BATF::JUN_CCR9 0.07703724 0.000000 0.000000 0.000000
#> BATF::JUN_IL10 0.00000000 0.000000 0.000000 0.000000

dat %>% 
  rownames_to_column("ID") %>% 
  filter(if_any(!matches("ID"), ~ . != 0)) %>% 
  column_to_rownames("ID")
#>                     A-XXX  fBM-XXX    P-XXX  vBM-XXX
#> BATF::JUN_AHR  1.51653276 2.228752 1.733567 3.003979
#> BATF::JUN_CCR9 0.07703724 0.000000 0.000000 0.000000

^{Created on 2021-04-12 by the reprex package (v1.0.0)}

akrun · Answer 4 · 2018-03-15T06:49:48.443

3

We could use reduce from purrr to get the sum of rows and filter the dataset based on the logical vector

library(tidyverse)
dat %>%
    reduce(`+`) %>%
    {. != 0} %>% 
   filter(dat, .)
#       A-XXX  fBM-XXX    P-XXX  vBM-XXX
#1 1.51653276 2.228752 1.733567 3.003979
#2 0.07703724 0.000000 0.000000 0.000000

NOTE: Within the %>%, the row.names gets stripped off. It may be better to create a new column or assign row.names later

If we need the row names as well, then create a row names column early and then use that to change the row names at the end

dat %>%
  rownames_to_column('rn') %>%
  filter(rowSums(.[-1]) != 0) %>% 
  `row.names<-`(., .[['rn']]) %>% select(-rn)
#                   A-XXX  fBM-XXX    P-XXX  vBM-XXX
#BATF::JUN_AHR  1.51653276 2.228752 1.733567 3.003979
#BATF::JUN_CCR9 0.07703724 0.000000 0.000000 0.000000

edited Mar 15 '18 at 06:49

answered Mar 15 '18 at 06:32

akrun

874,273
37
540
662

thanks. Actually my initial `dat` is a long series of pipe. How can I modify your code so that `filter(dat,.)` won't compate. – littleworth Mar 15 '18 at 06:35
correct your `filter` contains `dat` where as my dat is long series of pipe. – littleworth Mar 15 '18 at 06:38
thanks but doing that remove the row names. I want to preserve the row names. – littleworth Mar 15 '18 at 06:43
@scamander Do you have any other operations after this step? Because, even if we keep the row.names, it will get stripped off in the next chain – akrun Mar 15 '18 at 06:45

score 1 · Answer 5 · answered Mar 15 '18 at 06:54

Here is a third option that uses purrr::pmap to generate the indices of whether or not all rows are zero. Definitely less compact than filter_at, but opens up options for interesting and complex conditions using pmap!

dat <- structure(list(`A-XXX` = c(1.51653275922944, 0.077037240321129, 
                                  0), `fBM-XXX` = c(2.22875185527511, 0, 0), `P-XXX` = c(1.73356698481106, 
                                                                                         0, 0), `vBM-XXX` = c(3.00397859609183, 0, 0)), .Names = c("A-XXX", 
                                                                                                                                                   "fBM-XXX", "P-XXX", "vBM-XXX"), row.names = c("BATF::JUN_AHR", 
                                                                                                                                                                                                 "BATF::JUN_CCR9", "BATF::JUN_IL10"), class = "data.frame")

library(tidyverse)
dat %>%
  rownames_to_column() %>%
  bind_cols(all_zero = pmap_lgl(., function(rowname, ...) all(list(...) == 0))) %>%
  filter(all_zero == FALSE) %>%
  `rownames<-`(.$rowname) %>%
  select(-rowname, -all_zero)
#>                     A-XXX  fBM-XXX    P-XXX  vBM-XXX
#> BATF::JUN_AHR  1.51653276 2.228752 1.733567 3.003979
#> BATF::JUN_CCR9 0.07703724 0.000000 0.000000 0.000000

Created on 2018-03-14 by the reprex package (v0.2.0).

mgrund · Answer 6 · 2020-10-26T09:09:42.717

Here's another option using the row-wise operations of dplyr (with col1,col2,col3 defining three exemplary columns for which the rowwise sum is calculated):

library(tidyverse)

df <- df %>% 
    rowwise() %>% 
    filter(sum(c(col1,col2,col3)) != 0)

Alternatively, if you have tons of variables (columns) to select you can also use the tidyverse selection syntax via:

df <- df %>% 
    rowwise() %>% 
    filter(sum(c_across(col1:col3)) != 0)

For details see: https://dplyr.tidyverse.org/articles/rowwise.html

How to remove rows where all columns are zero using dplyr pipe

6 Answers6

Update 2022-11-11

Old solution (depecrated)

Linked