Datacheck : Compare string values (input) to existing language (Dutch dictionary) in R

Question

I am trying to filter out crappy open answers (string variables) like 'ffff' en 'fdaljfdlksajf' using an R script. I hoped that there was some kind of dictionary package available in R with which I could do this, but I can't seem to find it.

Another option would be to upload a list of Dutch (that's the dictionary I need) words and compare it to the input, but it's not very easy to find.

Any of you has ever experimented with this before and found a solution?

have you tried the tm or the qdap package in R? There is some text cleaning.. but I think is only english-based. Best luck. (Consider that it's unlikely that someone has done the same for Dutch.. Italian is the same: no luck there) — Ale, Dec 02 '16 at 15:53
Thank you for the suggestion Ale. Will take a look at it soon and get back to you — SHW, Dec 05 '16 at 18:24

score 0 · Answer 1 · answered Dec 03 '16 at 00:12

0

Try the package SnowballC. It's a word-stemming algorithm but supports languages, including Dutch, and includes vocabularies for each language.

library(SnowballC)
load(system.file("words", "dutch.RData", package = "SnowballC"))
voc[[1]] # Dutch words
voc[[2]] # Stemmed dutch words

Now that you have vocabularies, you could compare what percent of words in each open response match to the Dutch vocabulary, setting a threshold to filter out the "bad" answers.

answered Dec 03 '16 at 00:12

hoggue

147
7

Ola Hoggue. Thank you so much for the suggestion. I didn't have the time to take a look at it right away, but i will certainly do so in the next couple of days and give you feedback on the solution. It sounds promising – SHW Dec 05 '16 at 18:23

Datacheck : Compare string values (input) to existing language (Dutch dictionary) in R

1 Answers1