2

My data is already in a data frame, with one token per line. I'd like to filter out the rows that contain stop words.

The dataframe looks like:

docID <- c(1,2,2)
token <- c('the', 'cat', 'sat')
count <- c(10,20,30)
df <- data.frame(docID, token, count)

I've tried the below, but get an error:

library(tidyverse)
library(tidytext)
library(topicmodels)
library(stringr)
data('stop_words')
clean_df <- df %>%
  anti_join(stop_words, by=df$token)

Error:

Error: `by` can't contain join column `the`, `cat`, `sat` which is missing from LHS

How can I resolve this?

Adam_G
  • 7,337
  • 20
  • 86
  • 148

1 Answers1

9

When you set up anti_join(), you need to say what the column names are, on the left and right hand sides. In the stop_words data object in tidytext, the column is called word and in your dataframe, it is called token.

library(tidyverse)
library(tidytext)

docID <- c(1, 2, 2, 2, 3)
token <- c("the", "cat", "sat", "on-the-mat", "with3hats")
count <- c(10, 20, 30, 10, 20)
df <- data_frame(docID, token, count)


clean_df <- df %>%
  anti_join(stop_words, by= c("token" = "word"))

clean_df
#> # A tibble: 4 x 3
#>   docID token      count
#>   <dbl> <chr>      <dbl>
#> 1  2.00 cat         20.0
#> 2  2.00 sat         30.0
#> 3  2.00 on-the-mat  10.0
#> 4  3.00 with3hats   20.0

Notice that "the" is now gone because it is in the stop_words dataset.

In a comment, you asked about removing tokens that contain punctuation or numbers. I'd use filter() for this (you can actually use filter() to remove stopwords too, if you prefer.)

clean_df <- df %>%
  filter(!str_detect(token, "[:punct:]|[:digit:]"))

clean_df
#> # A tibble: 3 x 3
#>   docID token count
#>   <dbl> <chr> <dbl>
#> 1  1.00 the    10.0
#> 2  2.00 cat    20.0
#> 3  2.00 sat    30.0

If you want to do both, build up your object with both lines using pipes.

Julia Silge
  • 10,848
  • 2
  • 40
  • 48