3

I have a dataset with answers of user if they know a brand or not. Some of the users just answered nonsense, as you can see in my example.

meinstring <- c("----asdada", "no idea", "C&A", "aaaaaaaaaa", "---", "adaosdjasodajsdoad")


spamidenfifier <- function(x) {
  verhaeltnis <- str_count(tolower(x), "[aeoiu]") / str_count(x)
  sequenz <- sum(sequence(rle(as.character(data.frame(strsplit(as.character(x), ""))[,1]))$lengths) >= 3, na.rm = TRUE)
  if(str_count(x) > 4) { weight <- 0.9 }  else {  weight <- 1  } ## Gewicht, weil unwahrscheinlicher bei längerem String
  variation_buchstaben <- (length(unique(data.frame(strsplit(as.character(x), ""))[,1])) / str_count(x) * weight)
  if(verhaeltnis < 0.2 | verhaeltnis > 0.8 | sequenz > 0 | variation_buchstaben < 0.5) {
    return(TRUE)
  } else {
    return(FALSE)
  }
}


sapply(meinstring, spamidenfifier)

Output:

----asdada            no idea                C&A         aaaaaaaaaa                --- adaosdjasodajsdoad 
      TRUE              FALSE              FALSE               TRUE               TRUE              FALSE 

My function does not work too bad, however there might be better solutions. Is there a package or better method to identify if a word was just misspelled or a person answered nonsense. If not, suggestions to improve that function are highly appreciated!

edit: Updated some improvements :-)

  • 1
    I think a good first order solution is see if the words can be recognized as real words. You could use a spellchecker such as [hunspell](https://cran.r-project.org/web/packages/hunspell/vignettes/intro.html) and see if that package can recognize the words. If they cannot, the word is probably a bogus word. – Paul Hiemstra Nov 19 '18 at 16:51

1 Answers1

0

Just my spontaneous idea:

meinstring <- c("----asdada", "no idea", "C&A", "aaaaaaaaaa", "---", "adaosdjasodajsdoad", "+-*-", "*-+-", "adfpdflrraaeea")

grepl('^\\W+$|(?:[-!@#$%^&*\\[\\]()";:_<>.,=+/ ]){2,}|[-!@#$%^&*\\[\\]()";:_<>.,=+/ ]{3,}|[aeoiu]{3,}',
meinstring , perl = T) & !grepl("iou|zweieiig", meinstring) # add the exceptions in the second grepl.

[1]  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

There is no neat perfect solution.

Andre Elrico
  • 10,956
  • 6
  • 50
  • 69
  • 1
    Interesting try, but your solution will find words like delicious (iou) as nonsense. – iod Nov 19 '18 at 16:28
  • hehe :D, sure also in in german there is `zweieiig, Donauauen, Donau-Auen, Treueeid` some **micro** is well needed. – Andre Elrico Nov 19 '18 at 16:30
  • Yeah, I think a real solution will have to use something like the `lexicon` package and search through it. Which could be overkill, depending on how big the actual data is. – iod Nov 19 '18 at 16:35
  • wow i understand nothing ;-). But it does not work too bad. Guess it might be better than my solution –  Nov 19 '18 at 18:13