20

This seems like a simple question, but I have not come across a clean solution for it yet. I have a vector in R and I want to remove certain elements from the vector, however I want to avoid the vector[vector != "thiselement"] notation for a variety of reasons. In particular, here is what I am trying to do:

# this doesnt work
all_states = gsub(" ", "-", tolower(state.name)) %>% filter("alaska")

# this doesnt work either
all_states = gsub(" ", "-", tolower(state.name)) %>% filter(!= "alaska")

# this does work but i want to avoid this approach to filtering
all_states = gsub(" ", "-", tolower(state.name))
all_states = all_states[all_states != "alaska"]

can this be done in a simple manner? Thanks in advance for the help!

EDIT - the reason I'm struggling with this is because I'm only finding things online regarding filtering based on a column of a dataframe, for example:

my_df %>% filter(col != "alaska")

however I'm working with a vector not a dataframe here

Canovice
  • 9,012
  • 22
  • 93
  • 211
  • i just want to become more comfortable using dplyr to write cleaner code. I can technically do this with a 1-liner but it would have to be: all_states = gsub(" ", "-", tolower(state.name))[gsub(" ", "-", tolower(state.name)) != "alaska"] – Canovice May 24 '17 at 22:09
  • the list is gonna be expanded to include other states, and your solution doesn't account for the formatting to the state names i'm doing either – Canovice May 24 '17 at 22:15
  • 4
    The `d` in `dplyr` is for `data.frame`. "using dplyr to write cleaner code" should mean using `dplyr` for what it's made for (data frames) and not trying to use it when inappropriate (not data frames). – Gregor Thomas May 24 '17 at 22:29

4 Answers4

42

Update

As @r_31415 noted in the comments, packages such as stringr provide functions that can better address this question.

With str_subset(string, pattern, negate=FALSE), one could filter character vectors like

library(stringr)

# Strings that have at least one character that is neither "A" nor "B".
> c("AB", "BA", "ab", "CA") %>% str_subset("[^AB]")
[1] "ab" "CA"


# Strings that do not include characters "A" or "B".
> c("AB", "BA", "ab", "CA") %>% str_subset("[AB]", negate=TRUE)
[1] "ab"

By default, the pattern is interpreted as a regular expression. Therefore, to search literal patterns that contains special characters like (, *, and ?, one could enclose the pattern string with the modifier function fixed(literal_string) instead of escaping with double-backslash escape or the raw-string since R 4.0.0

# escape special character with "\\" (has to escape `\` with itself in a string literal).
> c("(123.5)", "12345") %>% str_subset("\\(123\\.5\\)")
[1] "(123.5)"

# R 4.0.0 supports raw-string, which is handy for regex strings
> c("(123.5)", "12345") %>% str_subset(r"{\(123\.5\)}")
[1] "(123.5)"

# use the fixed() modifier
> c("(123.5)", "12345") %>% str_subset(fixed("(123.5)"))
[1] "(123.5)"


## unexpected results if without escaping or the "fixed()" modifier
> c("(123.5)", "12345") %>% str_subset("(123.5)")
[1] "(123.5)" "12345"

Original Answer

Sorry for posting on a 5-month-old question to archive a simpler solution.

Package dplyr can filter character vectors in following ways:

> c("A", "B", "C", "D") %>% .[matches("[^AB]", vars=.)]
[1] "C" "D"
> c("A", "B", "C", "D") %>% .[.!="A"]
[1] "B" "C" "D"

The first approach allows you to filter with regular expression, and the second approach uses fewer words. It works because package dplyr imports package magrittr albeit masks its functions like extract, but not the placeholder ..

Details of placeholder . can be found on within help of forward-pipe operator %>%, and this placeholder has mainly three usage:

  • Using the dot for secondary purposes
  • Using lambda expressions with %>%
  • Using the dot-place holder as lhs

Here we are taking advantage of its 3rd usage.

Quar
  • 1,032
  • 11
  • 12
  • 2
    Is there a way to negate the first approach with "matches" ? – Jens Dec 09 '20 at 16:52
  • 1
    @Jens, one could negate via either indexing, such as `c("A", "B", "C", "D") %>% .[-matches("[^AB]", vars=.)] `, or regular expression itself, such as `c("A", "B", "C", "D") %>% .[matches("[AB]", vars=.)]` -- perhaps the caveats here is to prepend `-` instead of `!` to the `matches` selected indices for the negation, because `matches` returns an integer vector `c(3, 4)`, rather than a boolean mask `c(F, F, T, T)`. – Quar Dec 09 '20 at 18:49
  • 1
    Many thanks! I tried ! and it did not work. Now I know why. -matches worked – Jens Dec 09 '20 at 21:56
  • 1
    `c("A", "B", "C", "D") %>% str_subset("[^AB]")` ? – r_31415 Nov 19 '22 at 23:45
  • 1
    @r_31415 Indeed and thanks for the note! package `stringr` provides `str_subset("[AB]", negate=T)`, along with other useful string-related functions. It is a neat and modern approach to work with strings. Would you like to add it to the answer? – Quar Nov 20 '22 at 21:09
  • 1
    Maybe you can include it as an additional option in your own answer. Since it is already marked as the correct answer, more people will notice it in that way. – r_31415 Nov 21 '22 at 02:00
  • 1
    @r_31415 I have made an attempt, let me know if anything was missing :D – Quar Nov 21 '22 at 19:26
  • 1
    @Quar Very good and very comprehensive! :) – r_31415 Nov 21 '22 at 21:06
21

You may like to try magrittr::extract. e.g.

> library(magrittr)

> c("A", "B", "C", "D") %>% extract(.!="A")
[1] "B" "C" "D"

For more extract-like functions load magrittr package and type ?alises.

Łukasz Deryło
  • 1,819
  • 1
  • 16
  • 32
4

Pretty sure dplyr only really operates on data.frames. Here's a two line example coercing the vector to a data.frame and back.

myDf = data.frame(states = gsub(" ", "-", tolower(state.name))) %>% filter(states != "alaska")
all_states = myDf$states

or a gross one liner:

all_states = (data.frame(states = gsub(" ", "-", tolower(state.name))) %>% filter(states != "alaska"))$states
David Pedack
  • 482
  • 2
  • 10
  • got it. yeah maybe im making my life harder than it needs to be. okay thanks – Canovice May 24 '17 at 22:41
  • 2
    yeah, it'd be nice to have 1 tool to use. dplyr ends up looking a lot cleaner than the base R code in my opinion. unfortunately it always ends up a mess with vectors. – David Pedack May 25 '17 at 15:00
0

An easy way to the desired result within the tidyverse is to put the vector into a tibble and then pull out the vector.

tibble(myvec = gsub(" ", "-", tolower(state.name))) %>% 
   filter(myvec != "alaska") %>% pull(myvec)

With the desired Output: [1] "alabama" "arizona" "arkansas" "california" "colorado" ...