0

I have a tidy dataframe created from a text corpus. I want to create a new binary variable based on the presence of a string from a vector of strings in the tidy corpus texts. My current for loop works, but is much too slow with 600k observations, even though most observations are only 5 or so words.

Tidy df structure: 8 variables, with 8th being the text to be searched, by 600k observations, 9th variable should be 1/0 based on presence of pharma with abuse potential.

abusepharma <- c('xanax', 'diazepam', 'alprazolam', 'adderall', 'oxycodone', 'viagra', 'oxycontin', 'valium', 'fentanyl', 'cialis', 'tramadol', 'amphetamine', 'hydromorphone', 'hydromorphon')
name.clean_tidy$AbusePharma <- NA

for(i in 1:nrow(name.clean_tidy)){
  if(grepl(paste(abusepharma,collapse="|"), name.clean_tidy[i,8])){
    name.clean_tidy[i,9] <- 1
  }else{
    name.clean_tidy[i,9] <- 0
  }

}

Garglesoap
  • 565
  • 6
  • 18
  • why are you using a for loop instead of an ifelse statement? Since you are using tidytext, you are also using dplyr so, mutate(AbusePharma = ifelse......) – phiver Jun 06 '18 at 17:56
  • First, I would pre-allocate the pattern `pattern<- paste(abusepharma,collapse="|")`, and then just assign to the new variable, i.e. `name.clean_tidy$AbusePharma<-grepl(pattern, name.clean_tidy[, 8])`. You don't really need 0s and 1s, since they are equivalent to `FALSE` and `TRUE` respectively. If you really want 0s and 1, then just do `name.clean_tidy$AbusePharma<- name.clean_tidy$AbusePharma*1` – Yannis Vassiliadis Jun 06 '18 at 18:05

0 Answers0