My site is getting larger and it's starting to attract a lot of spam through various channels. The site has a lot of different types of UGC (profiles, forums, blog comments, status updates, private messages, etc, etc). I have various mitigation efforts underway, which I hope to deploy in a blitzkrieg fashion to convince the spammers that we're not a worthwhile target. I have high confidence in what I'm doing functionality wise, but one missing piece is killing all the old spam all at once.
Here's what I have:
- Large good/bad corpora (5-figure bad, 6 or 7-figure good). A lot of the spam has very reliable fingerprints, and the fact that I've sort of been ignoring it for 6 months helps :)
- Large, modular Rails site deployed to AWS. It's not a huge traffic site, but we're running 8 instances with the beginnings of a SOA.
- Ruby, Redis, Resque, MySQL, Varnish, Nginx, Unicorn, Chef, all on Gentoo
My requirements:
- I want it to perform reasonably well given the volume of data (therefore I'm wary of a pure ruby solution).
- I should be able to train multiple classifications to different types of content (419-scam vs botnet link spam)
- I would like to be able to add manual factors based on our own detective work (pattern matching, IP reuse, etc)
- Ultimately I want to construct a nice interface to be used with Ruby. If this requires getting my hands dirty in C or whatever, I can handle it, but I'll avoid it if I can.
I realize this is a long and vague question, but what I'm looking for primarily is just a list of good packages, and secondarily any random thoughts from someone who has built a similiar system about ways to approach it.