1

I am currently unsuccessfully trying to apply an autocorrection and a stemming using Hunspell to my data. The data in question is a tribble of sentences, each with an author, which are then to be evaluated via a more complex unnest function. This solution How do i optimize the performance of stemming and spell check in R? already describes a very effective way for autocorrection and stemming.

For me, the options are either to apply autocorrect and stemming to the complete sentences and then run my evaluations via unnest, or to check and adjust the individual words in the pipe function.

The following data describes my problem:

df <- tibble::tribble(
  ~text, ~author,
  "We aree drivng as fast as we drove yestrday or evven fastter zysxzw", "U1",
  "Today waas a beautifull day", "U2",
  "Hopefulli we learn to write correect one day", "U2"
)
df %>%
  unnest_tokens(input = text,
                output = word) %>%
  count(Author, word, sort = TRUE)

However, so far I have not found a solution to perform the autocorrection and stemming before this example evaluation. I would like to use only checked and matched words for the count function, for example.

I've been stuck on this problem for a while and am infinitely grateful for any input and ideas! Thank you!

Alex_
  • 189
  • 8

1 Answers1

0

Here is a simple guide how you can start:

You will need this function spellAndStem_tokens How do i optimize the performance of stemming and spell check in R?

spellAndStem_tokens <- function(sent, language = "en_US") {
    
    sent_t <- quanteda::tokens(sent)
    
    # extract types to only work on them
    types <- quanteda::types(sent_t)
    
    # spelling
    correct <- hunspell_check(
        words = as.character(types), 
        dict = hunspell::dictionary(language)
    )
    
    pattern <- types[!correct]
    replacement <- sapply(hunspell_suggest(pattern, dict = language), FUN = "[", 1)
    
    types <- stringi::stri_replace_all_fixed(
        types,
        pattern, 
        replacement,
        vectorize_all = FALSE
    )
    
    # stemming
    types <- hunspell_stem(types, dict = dictionary(language))
    
    
    # replace original tokens
    sent_t_new <- quanteda::tokens_replace(sent_t, quanteda::types(sent_t), as.character(types))
    
    sent_t_new <- quanteda::tokens_remove(sent_t_new, pattern = "NULL", valuetype = "fixed")
    
    paste(as.character(sent_t_new), collapse = " ")
}

then code for First output:

#install.packages("quanteda")
#install.packages("hunspell")
library(hunspell)
library(quanteda)
library(tidyverse)
library(tidytext)

df %>%  
    unnest_tokens(word, text) %>% 
    count(word, sort= TRUE) %>% 
    print(n=30)) 

   word           n
   <chr>      <int>
 1 we             3
 2 as             2
 3 day            2
 4 a              1
 5 aree           1
 6 beautifull     1
 7 correect       1
 8 drivng         1
 9 drove          1
10 evven          1
11 fast           1
12 fastter        1
13 hopefulli      1
14 learn          1
15 one            1
16 or             1
17 to             1
18 today          1
19 waas           1
20 write          1
21 yestrday       1
22 zysxzw         1

then code for Second Output:

df %>%  
    unnest_tokens(word, text) %>% 
    count(word, sort= TRUE) 
    mutate(word = spellAndStem_tokens(word)) %>% 
    print(n=30)

output:
  word                                                                                                              n
   <chr>                                                                                                         <int>
 1 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday     3
 2 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday     2
 3 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday     2
 4 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday     1
 5 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday     1
 6 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday     1
 7 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday     1
 8 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday     1
 9 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday     1
10 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday     1
11 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday     1
12 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday     1
13 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday     1
14 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday     1
15 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday     1
16 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday     1
17 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday     1
18 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday     1
19 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday     1
20 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday     1
21 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday     1
22 we a day a are beautiful correct drive drove even fast fast hopeful learn one or to today was write yesterday     1
> 
TarJae
  • 72,363
  • 6
  • 19
  • 66
  • Hi @TarJae! Thank you very much in advance. But I don't quite understand why complete sentences are returned? It must be only single words...? Also, I have swapped the spellAndStem function with the count function, only the corrected words should be processed further. – Alex_ Sep 05 '21 at 10:20
  • Could you please help me some more here? Thank you! @tarjae ! – Alex_ Oct 03 '21 at 12:08