Crowdsourcing reliability measurements - spam/fraud detection

Question

I'd like to collect some kind of geographical information from website users - for given set of data they will mark checkbox indicating whether place has or has not given property. Are there any tools/frameworks for detecting fraud or spam submissions based on whole colected data set (and possibly other info)? I'd like to get filtered, more reliable data.

There are some services / tools / frameworks for existing crowdsourcing tools, like Amazon Mechanical Turk (most, BTW, are non-free). Are you interested in such, or would you like pointers for how to do it yourself? — etov, Aug 28 '11 at 14:58
@etov - I think about extracting "truth" from gathered votes assuming fraud votes are minor and can be statistically distinguished — tomash, Aug 29 '11 at 12:31

score 2 · Accepted Answer · answered Aug 29 '11 at 16:55

Not sure if that's exactly what you're asking for, but here are some tips from my experience using Amazon Turk:

There are several academic papers dealing with such problems. here is a good one. In addition, based on the following general recommendations, I've created a custom procedure which worked on my data:

a. Include an open question, and filter out cases where it wasn't answered. It's harder to answer such a question automatically, and it might also be more time-consuming, thus less attractive, for a fraudster.

b. If possible, don't use a binary scale (i.e. a checkbox), but some grade (e.g. 1-4 or 1-6). This would give you more data to work with.

c. If available, filter out cases where the time spent in filling your form was too short. (especially useful if you include that open question)

d. If you have multiplicity of inputs per user, check for repetitive answers, and for users which consistently give far-from-average answers. If each user submits only a single "form", consider putting more than a single element/question in it, so you'll get multiple submissions per-user.

e. If you have only a single submission per user or user-id, your options are more limited. I can suggest filtering out outliars, (e.g. data points farther than 3 standard deviations from the average), in case you have enough data.

f. After all the filtering, check the agreement or disagreement in your data (e.g. by checking what proportion of your data points fall within x standard deviations from the average). In case of agreement, use the average; in case of disagreement, collect some more data.

Hope it helps,

I was thinking about custom data gathering and filtering (not using MTurk) but all advices are very valuable as well, thanks! — tomash, Sep 14 '11 at 08:19

Crowdsourcing reliability measurements - spam/fraud detection

1 Answers1