6

My site is getting larger and it's starting to attract a lot of spam through various channels. The site has a lot of different types of UGC (profiles, forums, blog comments, status updates, private messages, etc, etc). I have various mitigation efforts underway, which I hope to deploy in a blitzkrieg fashion to convince the spammers that we're not a worthwhile target. I have high confidence in what I'm doing functionality wise, but one missing piece is killing all the old spam all at once.

Here's what I have:

  • Large good/bad corpora (5-figure bad, 6 or 7-figure good). A lot of the spam has very reliable fingerprints, and the fact that I've sort of been ignoring it for 6 months helps :)
  • Large, modular Rails site deployed to AWS. It's not a huge traffic site, but we're running 8 instances with the beginnings of a SOA.
  • Ruby, Redis, Resque, MySQL, Varnish, Nginx, Unicorn, Chef, all on Gentoo

My requirements:

  1. I want it to perform reasonably well given the volume of data (therefore I'm wary of a pure ruby solution).
  2. I should be able to train multiple classifications to different types of content (419-scam vs botnet link spam)
  3. I would like to be able to add manual factors based on our own detective work (pattern matching, IP reuse, etc)
  4. Ultimately I want to construct a nice interface to be used with Ruby. If this requires getting my hands dirty in C or whatever, I can handle it, but I'll avoid it if I can.

I realize this is a long and vague question, but what I'm looking for primarily is just a list of good packages, and secondarily any random thoughts from someone who has built a similiar system about ways to approach it.

gtd
  • 16,956
  • 6
  • 49
  • 65

1 Answers1

5

We looked for an acceptable open source solution and didn't find one.

If you come to the same conclusion and decide to consider proprietary anti-spam, check out the paid Akismet collaborative spam filtering service. We've had decent performance from it across a dozen medium sized sites. It integrates with rails through rack and rackismet.

Mori
  • 27,279
  • 10
  • 68
  • 73
  • Definitely one thing I considered. I question the performance and relative cost of it, especially considering the different corpora I want to train for different purposes. Maybe I'm misguided, but I'm gonna keep looking to roll my own for the moment... – gtd Jun 06 '11 at 20:03