3

I am working with EMR data. Lots of entities within medical records are split into two different words (example - CT Scan) but I plan on joining these tokens to a single word by using an underscore (CT_Scan). Is there a faster way to perform this task on a huge corpus. My approach was using the "quanteda" package. Here is the code snippet -

# Sample text
    mytexts <- c("The new law included a capital gains tax, and an inheritance tax.",
             "New York City has raised taxes: an income tax and inheritance taxes.")

# Tokenize by white space 
    library(quanteda)
    mytoks <- tokens(mytexts, remove_punct = TRUE)

# list of tokens that need to be joined 
    myseqs <- list(c("tax"), c("income", "tax"), c("capital", "gains", "tax"), c("inheritance", "tax"))

# New list that includes concatenated tokens
        clean_toks <- tokens_compound(mytoks, myseqs)

This task was performed on about 3 billion tokens and "compound_token" function took a lot amount of time(>12 hrs). Is there a better way to solve this problem?

x1carbon
  • 287
  • 1
  • 15
  • it appears that pkg has options for using more threads for various tasks. have you explored that at all? – hrbrmstr Dec 31 '17 at 01:05
  • `quanteda` uses n-1 threads on machines that have mutliple cores / threads. The number of threads in use can be confirmed with `quanteda_options()`. – Len Greski Dec 31 '17 at 01:25
  • 2
    Also, what are you trying to do with the compound tokens? Depending on how you plan to subsequently process the data, there may be more efficient ways to process the data than `tokens_compound()`. That said, regarding your comment about performance, `quanteda` is the best performing NLP package for R. – Len Greski Dec 31 '17 at 01:57
  • 2
    @x1carbon I suggest you file a GitHub issue and provide some details of the size of your dataset in terms of number of documents, average length of document (in tokens), unique token types, and a percentage of the phrases that need to be compounded. We could then create some comparable data and profile it. You should also provide full details on your machine specs. – Ken Benoit Dec 31 '17 at 09:14
  • @LenGreski Thank you for your inputs. I am trying the quanteda_options("threads"). Let's see how it goes. – x1carbon Jan 02 '18 at 20:23
  • @KenBenoit Is there a different way to solve this problem than the qunateda approach? – x1carbon Jan 02 '18 at 20:25
  • @x1carbon - Ken is the author of `quanteda`, so you've got the best possible person trying to answer your question. Ken's comment to file a Github issue means to go to the quanteda GitHub site's [issues page](https://github.com/kbenoit/quanteda/issues), file a an "issue", and include more detail about your data set than you've posted to Stackoverflow. – Len Greski Jan 02 '18 at 20:28

0 Answers0