0

I'm trying to use qdap::check_spelling() on 7M very short sentences (e.g. 1 - 4 word sentences).

I'm running the script via ssh/linux and after about 6 hours of running I'm getting a "killed" message which I think means I'm using up a lot of memory? I'm using a 64GB server.

My goal is to return a data frame to write to a csv with the following fields:

Unique list of misspelt words | The frequency of the misspelt word | an example of the misspelt word for context

Ordered in descending order of frequency to find the most common misspelt words. Once I generate this we have a support team who are going to work through the most frequent misspellings and correct as many as they can. They asked for some context of the misspelt words, i.e. seeing them within the larger sentence. So, I'm attempting to use pull the first instance of the misspelt word and add to the third column.

Example:

library(tidyverse)
library(qdap)
# example data
exampledata <- data.frame(
  id = 1:5,
  text = c("cats dogs dgs cts oranges",
           "orngs orngs cats dgs",
           "bannanas, dogs",
           "cats cts dgs bnnanas",
           "ornges fruit")
)

# check for unique misspelt words using qdap
all.misspelts <- check_spelling(exampledata$text) %>% data.frame %>% select(row:not.found)
unique.misspelts <- unique(all.misspelts$not.found)

# for each misspelt word, get the first instance of it appearing for context/example of word in a sentence
contexts.misspellts.index <- lapply(unique.misspelts, function(x) {
  filter(all.misspelts, grepl(paste0("\\b",x,"\\b"), not.found))[1, "row"]
}) %>% unlist

# join it all together in a data farem to write to a csv
contexts.misspelts.vector <- exampledata[contexts.misspellts.index, "text"]
freq.misspelts <- table(all.misspelts$not.found) %>% data.frame() %>% mutate(Var1 = as.character(Var1))
misspelts.done <- data.frame(unique.misspelts, contexts.misspelts.vector, stringsAsFactors = F) %>%
  left_join(freq.misspelts, by = c("unique.misspelts" = "Var1")) %>% arrange(desc(Freq))
write.csv(x = misspelts.done, file="~/csvs/misspelts.example_data_done.csv", row.names=F, quote=F)

This is what it looks like:

> misspelts.done
  unique.misspelts contexts.misspelts.vector Freq
1              dgs cats dogs dgs cts oranges    3
2              cts cats dogs dgs cts oranges    2
3            orngs      orngs orngs cats dgs    2
4         bannanas            bannanas, dogs    1
5          bnnanas      cats cts dgs bnnanas    1
6           ornges              ornges fruit    1

This is exactly what I want! But I'm struggling to run it on a real dataset of 7m docs in my text. The script runs for several hours then sends a "killed" message in the terminal.

I could break it up and loop over the data in chunks. But before I do that, is there a better way to achieve my goal?

Doug Fir
  • 19,971
  • 47
  • 169
  • 299
  • I'm the author of `qdap`...may I suggest Jeroen Ooms' excellent hunspell package instead. It didn't exist when qdap's spell checking tools were created but do now and are terrific. https://CRAN.R-project.org/package=hunspell – Tyler Rinker Jul 20 '17 at 02:16
  • Hi @TylerRinker thank you for the suggestion (and for qdap). Actually, I managed to get around this issue in this instance by just splitting up my data into chunks then looping over each chunk with ```qdap::check_spelling()```. Next time I'll take a look at hunspell too – Doug Fir Jul 20 '17 at 03:39

0 Answers0