How to tokenise on hyphens using unnest_tokens in R

Question

I'm trying to tokenise a dataframe containing strings. Some contain hyphens, and I'd like to tokenise on hyphens using unnest_tokens()

I've tried upgrading tidytext from 0.1.9 to 0.2.0 I've tried a number of variations on regex to capture the hyphen from:



df <- data.frame(words = c("Solutions for the public sector | IT for business", "Transform the IT experience - IT Transformation - ITSM")

df %>% 
unnest_tokens(query, words, 
                token = "regex",
                pattern = "(?:\\||\\:|[-]|,)")

I expect to see:

query
solutions for the public sector
it for business
transform the it experience
it transformation
itsm

instead, I get the tokenised no hyphen lines:

query
solutions for the public sector
it for business

Try `df %>% unnest_tokens(query, words, token = stringr::str_split, pattern = "[-:,|]")` — Wiktor Stribiżew, Jun 13 '19 at 17:45

score 1 · Answer 1 · answered Jun 14 '19 at 09:23

You may use

library(stringr)
df %>%  
  unnest_tokens(query, words, token = stringr::str_split, pattern = "[-:,|]")

This command will use stringr::str_split to split against the [-:,|] pattern: -, :, , or | chars. Note they do not need to be escaped inside a character class/bracket expression. The hyphen does not need to be escaped when it is the first or last char, and the others are just not special in a character class.

How to tokenise on hyphens using unnest_tokens in R

1 Answers1