0

I am trying to filter out crappy open answers (string variables) like 'ffff' en 'fdaljfdlksajf' using an R script. I hoped that there was some kind of dictionary package available in R with which I could do this, but I can't seem to find it.

Another option would be to upload a list of Dutch (that's the dictionary I need) words and compare it to the input, but it's not very easy to find.

Any of you has ever experimented with this before and found a solution?

SHW
  • 461
  • 7
  • 26
  • 1
    have you tried the tm or the qdap package in R? There is some text cleaning.. but I think is only english-based. Best luck. (Consider that it's unlikely that someone has done the same for Dutch.. Italian is the same: no luck there) – Ale Dec 02 '16 at 15:53
  • Thank you for the suggestion Ale. Will take a look at it soon and get back to you – SHW Dec 05 '16 at 18:24

1 Answers1

0

Try the package SnowballC. It's a word-stemming algorithm but supports languages, including Dutch, and includes vocabularies for each language.

library(SnowballC)
load(system.file("words", "dutch.RData", package = "SnowballC"))
voc[[1]] # Dutch words
voc[[2]] # Stemmed dutch words

Now that you have vocabularies, you could compare what percent of words in each open response match to the Dutch vocabulary, setting a threshold to filter out the "bad" answers.

hoggue
  • 147
  • 7
  • Ola Hoggue. Thank you so much for the suggestion. I didn't have the time to take a look at it right away, but i will certainly do so in the next couple of days and give you feedback on the solution. It sounds promising – SHW Dec 05 '16 at 18:23