Tokenizing word using tidytext - preserving punctuation

Question

I've been trying to preserve punctation like "-" "(" "/" "'" when tokenizing word.

data = tibble(title = "Computer-aided detection (1 / 2)")
data %>% unnest_tokens(input = title, 
                    output = słowo, 
                    token = "ngrams", 
                    n = 2)

I want output to be like this:

computer-aided
aided detection
detection (1
(1 / 2)

Any suggestions?

score 1 · Answer 1 · answered Apr 17 '20 at 10:16

If you want to preserve these values "(" "/" ")" the output would be "(1 /" and "/ 2)" not "(1 / 2)". This last one would be a 3gram. Also if you want to keep the hyphen (-) line 2 would not exist as it would not split on this value.

tidytext uses the tokenizer package to unnest the data. the ngram tokenizer can not handle these exemptions.

Here is an example using quanteda with the option fasterword that gets most of your needs.

library(quanteda)
tokens(data$title, what =  "fasterword", remove_punct = FALSE) %>% 
  tokens_ngrams(n = 2, concatenator = " ")

Tokens consisting of 1 document.
text1 :
[1] "Computer-aided detection" "detection (1"             "(1 /"                     "/ 2)"

You could experiment with different values of n like n = 2:3 to see where that gets you and filter out what you don't need.

Tokenizing word using tidytext - preserving punctuation

1 Answers1