I am working with EMR data. Lots of entities within medical records are split into two different words (example - CT Scan) but I plan on joining these tokens to a single word by using an underscore (CT_Scan). Is there a faster way to perform this task on a huge corpus. My approach was using the "quanteda" package. Here is the code snippet -
# Sample text
mytexts <- c("The new law included a capital gains tax, and an inheritance tax.",
"New York City has raised taxes: an income tax and inheritance taxes.")
# Tokenize by white space
library(quanteda)
mytoks <- tokens(mytexts, remove_punct = TRUE)
# list of tokens that need to be joined
myseqs <- list(c("tax"), c("income", "tax"), c("capital", "gains", "tax"), c("inheritance", "tax"))
# New list that includes concatenated tokens
clean_toks <- tokens_compound(mytoks, myseqs)
This task was performed on about 3 billion tokens and "compound_token" function took a lot amount of time(>12 hrs). Is there a better way to solve this problem?