When you set up anti_join()
, you need to say what the column names are, on the left and right hand sides. In the stop_words
data object in tidytext, the column is called word
and in your dataframe, it is called token
.
library(tidyverse)
library(tidytext)
docID <- c(1, 2, 2, 2, 3)
token <- c("the", "cat", "sat", "on-the-mat", "with3hats")
count <- c(10, 20, 30, 10, 20)
df <- data_frame(docID, token, count)
clean_df <- df %>%
anti_join(stop_words, by= c("token" = "word"))
clean_df
#> # A tibble: 4 x 3
#> docID token count
#> <dbl> <chr> <dbl>
#> 1 2.00 cat 20.0
#> 2 2.00 sat 30.0
#> 3 2.00 on-the-mat 10.0
#> 4 3.00 with3hats 20.0
Notice that "the" is now gone because it is in the stop_words
dataset.
In a comment, you asked about removing tokens that contain punctuation or numbers. I'd use filter()
for this (you can actually use filter()
to remove stopwords too, if you prefer.)
clean_df <- df %>%
filter(!str_detect(token, "[:punct:]|[:digit:]"))
clean_df
#> # A tibble: 3 x 3
#> docID token count
#> <dbl> <chr> <dbl>
#> 1 1.00 the 10.0
#> 2 2.00 cat 20.0
#> 3 2.00 sat 30.0
If you want to do both, build up your object with both lines using pipes.