1

I'm trying to tokenise a dataframe containing strings. Some contain hyphens, and I'd like to tokenise on hyphens using unnest_tokens()

I've tried upgrading tidytext from 0.1.9 to 0.2.0 I've tried a number of variations on regex to capture the hyphen from:



df <- data.frame(words = c("Solutions for the public sector | IT for business", "Transform the IT experience - IT Transformation - ITSM")

df %>% 
unnest_tokens(query, words, 
                token = "regex",
                pattern = "(?:\\||\\:|[-]|,)")

I expect to see:

query
solutions for the public sector
it for business
transform the it experience
it transformation
itsm

instead, I get the tokenised no hyphen lines:

query
solutions for the public sector
it for business
alexmathios
  • 109
  • 1
  • 7

1 Answers1

1

You may use

library(stringr)
df %>%  
  unnest_tokens(query, words, token = stringr::str_split, pattern = "[-:,|]")

This command will use stringr::str_split to split against the [-:,|] pattern: -, :, , or | chars. Note they do not need to be escaped inside a character class/bracket expression. The hyphen does not need to be escaped when it is the first or last char, and the others are just not special in a character class.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563