3

What is the current state of the art in Spam Prevention techniques?

I've already read Paul Graham's articles about Bayesian filtering. (A Plan for Spam and Better Bayesian Filtering)

and wanted to know if there is some more up to date articles available? (preferably AI related ones)

Zahra E
  • 427
  • 1
  • 6
  • 25
  • 6
    Spam *prevention*? That requires better child rearing. – Kaz Apr 09 '12 at 07:06
  • 1
    Filtering of any kind sucks because it requires your SMTP server to accept (i.e. deliver) the spam. The problem with accepting spam is that false positives sit in your spam folder without further action being taken. Someone is waiting for you to reply to something you have not seen, and no non-delivery notice is generated to alert them to this. If you *do* look in your spam folder, the spammers have won. You have not achieved anything; you're still scanning your e-mail, picking apart spam from nonspam, just in a different folder. – Kaz Apr 09 '12 at 07:09
  • Looking for theoretical solutions, you'd probably be better off asking the [computer science stackexchange](http://cs.stackexchange.com/). – Nikana Reklawyks Oct 22 '12 at 18:42

4 Answers4

4

In the case that you are trying to prevent spam words, sentences, such as "fasdhusdhfi", and not anything else, you could always have a database of words and their synonyms. You could then check if the input has less then 50% known words in the database, you could raise a flag. You can make an offline database, which I wouldn't recommend, or you could use some online databases. For a list of words, I would suggest

http://thesaurus.com/

For a list of synonyms of those words, I would suggest

http://www.synonyms.net/

I think these two would probably be the best for said purpose, as they both have an API (for synonyms.net its on this page) you can use, so it doesn't require parsing the returned pages for words.

You could then, in turn, combine this with other methods, as previously stated, such as Bayesian filtering.

While this does not really fit to your AI needs, it does prevent a range of messages.

To fit your 'AI' request, you could probably be able to adapt ALICE's Spam.aiml. It is in AIML format, but contains a lot of permutations of 4-symbol spam. The problem with this is that it is slow.

A possible alternative to Spam.aiml would be to use the rules of the English language to detect spam, and filter it. The following rules could be used:

Every word must have at least one vowel. For this, the letter ‘Y’ is considered a vowel.

No word has more than 3 consonants in a row. For this purpose, ‘TH’ is considered one letter (so as to not mess up on words like 'streNGTH').

No word is longer is longer than 34 letters. The exceptions to this would be the words listed here.

Some letter combinations cannot occur. An example of this would be that the letters ‘R’ and ‘C’ never appear directly beside each other in a regular, non-slang conversation.

You could have a database of impossible combinations. I made a small one by running every permutation of 2-letters against a database containing 6578 words, and came up with these results:

df bf kf gf jk kj sj fj gj hj lj sl

Those are all impossible combinations. Of course, combinations such as 'zz' are omitted. Those are:

aa bb cc dd ee ff gg hh ii jj kk ll mm nn pp qq rr ss tt uu vv ww xx yy zz

'oo' is omitted, as it appears in many words, such as 'look'.

Segments of the string that are longer than 2 characters and repeat consecutively would be flagged as spam. In the string 'lololololol', the repeated segment is 'lo', and is flagged as spam.

More than 3 of the same vowels in the same word would be flagged as spam. For example: 'oooouuuu' would be flagged as spam, as 'o' and 'u' are vowels that have been repeated for longer then 3 times.

No word larger than 1 character may be made up of just vowels. In this case, 'Y' would not be considered a vowel, as to keep from getting a false positive on 'you'.

Any input that does not follow these regulations by 15% or more (margin for misspellings) would be redirected to spam.

If you do decide to modify ALICE's files, you can get alot of them here. Newer version may be found at ALICE's Google Code page.

You could also use a spellchecker to help with spam detection. You could run the input against a spellchecker such as PyEnchant (for Python), and read the suggestions. If the input has no suggestions, then it can be safely assumed, in most cases, that it is spam.

It's not perfect, but it does should to a limited extent. I made a small program to demonstrate what spam filtering like this would result in. This is the output:

>>> fdsahjfsd
'fdsahjfsd' is spam since more than 3 consonants appear in a row
>>> fhsdjhfksd
'fhsdjhfksd' is spam since it has no vowel
>>> jfsdkjl
'jfsdkjl' is spam since it has no vowel
>>> dk
'dk' is spam since it has no vowel
>>> ddds
'ddds' is spam since it has no vowel
>>> uxxs
'uxxs' is not spam
>>> kd
'kd' is spam since it has no vowel
>>> ukd
'ukd' is not spam
>>> asdjaskljlaskjldkasjkljdklas
'asdjaskljlaskjldkasjkljdklas' is spam since it is too long
>>> hdjaskj
'hdjaskj' is spam since invalid sequences detected

As I said before, it's not perfect, as it returns false positives (such as 'uxxs'), but this could be fixed with a spell checking implementation.

The backdraw with a spell checking implementation would be that your spam detection would be based on the amount of words the dictionary has. Most spellchecker only have the first 10,000 words, so some uncommon words may be blocked as spam. However, checking if over 15% of the input is invalid could solve this.

If you think it may help you, you can get the small program I made from here. It's written in Python.

Also, as other answers here have said, a 'state-of-the-art' spam filter would require a mixture of methods.

You can use SpamAssasin, PyZor, Reverend, and Orange, but probably the best thing to do would be to try to combine all of those together.

If you would like to use Lisp for this, a nice article about Bayesian filtering in Lisp is located here.

If you would like to do this via a neural network, then this Codeproject article may be useful. It utilizes a simple and easy to use dll, and the example code can almost directly be used for the task of spam filtering.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Xyene
  • 2,304
  • 20
  • 36
  • i didnt downvote but it seems like there is no real "answer" in your post and is more a laundry list of suggestions. – Woot4Moo Apr 11 '12 at 23:54
  • True, but the question itself is a tad bit vague. No 'preferred' programming language is stated, so I gave a list of implementations in different languages, and it is not stated whether it is spam email detection, or spam string detection. No once again, 'preferred' implementation is stated either. So I put out a couple of online databases. Therefore, I tried giving as many, as you said, suggestions, based on what the question was. If the question was indeed asking about SPAM EMAIL detection, then my answer, or laundry list of suggestions, is useless, aside from the final links. – Xyene Apr 12 '12 at 00:20
  • sorry if my question is vague. I didn't mention "spam email detection" or anything else, cause according to my Googling I found such these classifications nowhere describing the related techniques.Actually I was looking for theoretical solutions which are generally stated for different kinds of spam.(implementation is not the case,let alone the programming language)and of course I wasn't the one who down voted you :) – Zahra E Apr 12 '12 at 09:18
3

The state of the art is not so much any patricular algorithm as in the quality and amount of input data. To reach for the state of the art, you need hundreds of thousands of active users, millions of messages per day. In other words, be Gmail, Yahoo, or Hotmail, or have the means to obtain similarly massive amounts of real-time data.

Save your verdict until the last possible moment; be prepared to pull a message out of the user's inbox just before they request a message listing. Figure out which users to trust, and apply their verdicts to the messages of all other users. Collect as many external inputs as you can (user verdicts, sender reputation, URL destination analysis, what have you), and feed them into your machine learning machinery.

Trying to filter spam based on message contents alone is a losing game; the spammers know how to mutate their messages to the point where a Bayesian classifier can barely see anything but noise. But you can use this against them. SpamAssassin has many proofs of this, but again, you need dynamic analysis of real-time data to really pull it off. I would even claim that once you have enough relevant inputs, the precise method you use for formulating a verdict is of secondary importance.

tripleee
  • 175,061
  • 34
  • 275
  • 318
1

I had been (out of sheer laziness) rolling with SpamAssassin's bayes implementation for awhile, and it had been performing rather poorly.

A few months back, I added collaborative filtering systems Vipul's Razor and Pyzor to my arsenal, with SpamAssassin in control, raising the spam scores. I feed my spams to both systems on a semi-regular basis. It's still not perfect, but my phone goes off a lot less frequently now.

It seems "state-of-the-art" is a combination of effective techniques.

Mattie
  • 20,280
  • 7
  • 36
  • 54