3

I would like to explicitly replace specific tokens defined in objects of class tokens of the package quanteda. I fail to replicate a standard approach that works well with stringr.

The objective is to replace all tokens of the form "XXXof" in two tokens of the form c("XXX", "of").

Please, have a look at the minimal below:

suppressPackageStartupMessages(library(quanteda))
library(stringr)

text = "It was a beautiful day down to the coastof California."

# I would solve this with stringr as follows: 
text_stringr = str_replace( text, "(^.*?)(of)", "\\1 \\2" )
text_stringr
#> [1] "It was a beautiful day down to the coast of California."

# I fail to find a similar solution with quanteda that works on objects of class tokens
tok = tokens( text )

# I want to replace "coastof" with "coast"
tokens_replace( tok, "(^.*?)(of)", "\\1 \\2", valuetype = "regex" )
#> Tokens consisting of 1 document.
#> text1 :
#>  [1] "It"         "was"        "a"          "beautiful"  "day"       
#>  [6] "down"       "to"         "the"        "\\1 \\2"    "California"
#> [11] "."

Any workaround?

Created on 2021-03-16 by the reprex package (v1.0.0)

Francesco Grossetti
  • 1,555
  • 9
  • 17

1 Answers1

1

You can use a mixture to build a list of the words needing separating and their separated form, then use tokens_replace() to perform the replacement. This has the advantage of allowing you to curate the list before applying it, which means you can verify that you haven't caught replacements that you probably don't want to apply.

suppressPackageStartupMessages(library("quanteda"))

toks <- tokens("It was a beautiful day down to the coastof California.")

keys <- as.character(tokens_select(toks, "(^.*?)(of)", valuetype = "regex"))
vals <- stringr::str_replace(keys, "(^.*?)(of)", "\\1 \\2") %>%
  strsplit(" ")

keys
## [1] "coastof"
vals
## [[1]]
## [1] "coast" "of"

tokens_replace(toks, keys, vals)
## Tokens consisting of 1 document.
## text1 :
##  [1] "It"         "was"        "a"          "beautiful"  "day"       
##  [6] "down"       "to"         "the"        "coast"      "of"        
## [11] "California" "."

Created on 2021-03-16 by the reprex package (v1.0.0)

Ken Benoit
  • 14,454
  • 27
  • 50