dplyr filter on a vector rather than a dataframe in R

Question

This seems like a simple question, but I have not come across a clean solution for it yet. I have a vector in R and I want to remove certain elements from the vector, however I want to avoid the vector[vector != "thiselement"] notation for a variety of reasons. In particular, here is what I am trying to do:

# this doesnt work
all_states = gsub(" ", "-", tolower(state.name)) %>% filter("alaska")

# this doesnt work either
all_states = gsub(" ", "-", tolower(state.name)) %>% filter(!= "alaska")

# this does work but i want to avoid this approach to filtering
all_states = gsub(" ", "-", tolower(state.name))
all_states = all_states[all_states != "alaska"]

can this be done in a simple manner? Thanks in advance for the help!

EDIT - the reason I'm struggling with this is because I'm only finding things online regarding filtering based on a column of a dataframe, for example:

my_df %>% filter(col != "alaska")

however I'm working with a vector not a dataframe here

i just want to become more comfortable using dplyr to write cleaner code. I can technically do this with a 1-liner but it would have to be: all_states = gsub(" ", "-", tolower(state.name))[gsub(" ", "-", tolower(state.name)) != "alaska"] — Canovice, May 24 '17 at 22:09
the list is gonna be expanded to include other states, and your solution doesn't account for the formatting to the state names i'm doing either — Canovice, May 24 '17 at 22:15
The `d` in `dplyr` is for `data.frame`. "using dplyr to write cleaner code" should mean using `dplyr` for what it's made for (data frames) and not trying to use it when inappropriate (not data frames). — Gregor Thomas, May 24 '17 at 22:29

Quar · Accepted Answer · 2022-11-21T19:23:04.687

Update

As @r_31415 noted in the comments, packages such as stringr provide functions that can better address this question.

With str_subset(string, pattern, negate=FALSE), one could filter character vectors like

library(stringr)

# Strings that have at least one character that is neither "A" nor "B".
> c("AB", "BA", "ab", "CA") %>% str_subset("[^AB]")
[1] "ab" "CA"


# Strings that do not include characters "A" or "B".
> c("AB", "BA", "ab", "CA") %>% str_subset("[AB]", negate=TRUE)
[1] "ab"

By default, the pattern is interpreted as a regular expression. Therefore, to search literal patterns that contains special characters like (, *, and ?, one could enclose the pattern string with the modifier function fixed(literal_string) instead of escaping with double-backslash escape or the raw-string since R 4.0.0

# escape special character with "\\" (has to escape `\` with itself in a string literal).
> c("(123.5)", "12345") %>% str_subset("\\(123\\.5\\)")
[1] "(123.5)"

# R 4.0.0 supports raw-string, which is handy for regex strings
> c("(123.5)", "12345") %>% str_subset(r"{\(123\.5\)}")
[1] "(123.5)"

# use the fixed() modifier
> c("(123.5)", "12345") %>% str_subset(fixed("(123.5)"))
[1] "(123.5)"


## unexpected results if without escaping or the "fixed()" modifier
> c("(123.5)", "12345") %>% str_subset("(123.5)")
[1] "(123.5)" "12345"

Original Answer

Sorry for posting on a 5-month-old question to archive a simpler solution.

Package dplyr can filter character vectors in following ways:

> c("A", "B", "C", "D") %>% .[matches("[^AB]", vars=.)]
[1] "C" "D"
> c("A", "B", "C", "D") %>% .[.!="A"]
[1] "B" "C" "D"

The first approach allows you to filter with regular expression, and the second approach uses fewer words. It works because package dplyr imports package magrittr albeit masks its functions like extract, but not the placeholder ..

Details of placeholder . can be found on within help of forward-pipe operator %>%, and this placeholder has mainly three usage:

Using the dot for secondary purposes

Using lambda expressions with %>%

Using the dot-place holder as lhs

Here we are taking advantage of its 3rd usage.

Is there a way to negate the first approach with "matches" ? — Jens, Dec 09 '20 at 16:52
@Jens, one could negate via either indexing, such as `c("A", "B", "C", "D") %>% .[-matches("[^AB]", vars=.)] `, or regular expression itself, such as `c("A", "B", "C", "D") %>% .[matches("[AB]", vars=.)]` -- perhaps the caveats here is to prepend `-` instead of `!` to the `matches` selected indices for the negation, because `matches` returns an integer vector `c(3, 4)`, rather than a boolean mask `c(F, F, T, T)`. — Quar, Dec 09 '20 at 18:49
Many thanks! I tried ! and it did not work. Now I know why. -matches worked — Jens, Dec 09 '20 at 21:56
@r_31415 Indeed and thanks for the note! package `stringr` provides `str_subset("[AB]", negate=T)`, along with other useful string-related functions. It is a neat and modern approach to work with strings. Would you like to add it to the answer? — Quar, Nov 20 '22 at 21:09
Maybe you can include it as an additional option in your own answer. Since it is already marked as the correct answer, more people will notice it in that way. — r_31415, Nov 21 '22 at 02:00
@r_31415 I have made an attempt, let me know if anything was missing :D — Quar, Nov 21 '22 at 19:26

score 21 · Answer 2 · answered May 25 '17 at 07:46

21

You may like to try magrittr::extract. e.g.

> library(magrittr)

> c("A", "B", "C", "D") %>% extract(.!="A")
[1] "B" "C" "D"

For more extract-like functions load magrittr package and type ?alises.

answered May 25 '17 at 07:46

Łukasz Deryło

1,819
1
16
32

No documentation for ‘alises’ in specified packages and libraries: – Hielke Walinga Apr 02 '20 at 13:19
1

It must've been removed in current version `?extract` works now. – Łukasz Deryło Apr 03 '20 at 04:55
2

very unfortunate that `tidyr` extract means a totally different thing. Love pipes, and this vector function is awesome! – JelenaČuklina Aug 12 '20 at 10:11
This works not only with characters but also with numeric data, e.g. `vec %>% extract(.==123.45)`. So your suggestion works perfectly and is concise, too. – jaggedjava Aug 28 '23 at 20:14

score 4 · Answer 3 · answered May 24 '17 at 22:17

4

Pretty sure dplyr only really operates on data.frames. Here's a two line example coercing the vector to a data.frame and back.

myDf = data.frame(states = gsub(" ", "-", tolower(state.name))) %>% filter(states != "alaska")
all_states = myDf$states

or a gross one liner:

all_states = (data.frame(states = gsub(" ", "-", tolower(state.name))) %>% filter(states != "alaska"))$states

answered May 24 '17 at 22:17

David Pedack

482
2
10

got it. yeah maybe im making my life harder than it needs to be. okay thanks – Canovice May 24 '17 at 22:41
2

yeah, it'd be nice to have 1 tool to use. dplyr ends up looking a lot cleaner than the base R code in my opinion. unfortunately it always ends up a mess with vectors. – David Pedack May 25 '17 at 15:00

score 0 · Answer 4 · answered Mar 03 '23 at 19:09

An easy way to the desired result within the tidyverse is to put the vector into a tibble and then pull out the vector.

tibble(myvec = gsub(" ", "-", tolower(state.name))) %>% 
   filter(myvec != "alaska") %>% pull(myvec)

With the desired Output: [1] "alabama" "arizona" "arkansas" "california" "colorado" ...

dplyr filter on a vector rather than a dataframe in R

4 Answers4

Linked