What is a good open source package for building flexible spam detection on a large Rails site?

Question

My site is getting larger and it's starting to attract a lot of spam through various channels. The site has a lot of different types of UGC (profiles, forums, blog comments, status updates, private messages, etc, etc). I have various mitigation efforts underway, which I hope to deploy in a blitzkrieg fashion to convince the spammers that we're not a worthwhile target. I have high confidence in what I'm doing functionality wise, but one missing piece is killing all the old spam all at once.

Here's what I have:

Large good/bad corpora (5-figure bad, 6 or 7-figure good). A lot of the spam has very reliable fingerprints, and the fact that I've sort of been ignoring it for 6 months helps :)
Large, modular Rails site deployed to AWS. It's not a huge traffic site, but we're running 8 instances with the beginnings of a SOA.
Ruby, Redis, Resque, MySQL, Varnish, Nginx, Unicorn, Chef, all on Gentoo

My requirements:

I want it to perform reasonably well given the volume of data (therefore I'm wary of a pure ruby solution).
I should be able to train multiple classifications to different types of content (419-scam vs botnet link spam)
I would like to be able to add manual factors based on our own detective work (pattern matching, IP reuse, etc)
Ultimately I want to construct a nice interface to be used with Ruby. If this requires getting my hands dirty in C or whatever, I can handle it, but I'll avoid it if I can.

I realize this is a long and vague question, but what I'm looking for primarily is just a list of good packages, and secondarily any random thoughts from someone who has built a similiar system about ways to approach it.

score 5 · Accepted Answer · answered Jun 03 '11 at 21:58

5

We looked for an acceptable open source solution and didn't find one.

If you come to the same conclusion and decide to consider proprietary anti-spam, check out the paid Akismet collaborative spam filtering service. We've had decent performance from it across a dozen medium sized sites. It integrates with rails through rack and rackismet.

answered Jun 03 '11 at 21:58

Mori

27,279
10
68
73

Definitely one thing I considered. I question the performance and relative cost of it, especially considering the different corpora I want to train for different purposes. Maybe I'm misguided, but I'm gonna keep looking to roll my own for the moment... – gtd Jun 06 '11 at 20:03

What is a good open source package for building flexible spam detection on a large Rails site?

1 Answers1