1

Background:

I'm downloading my twitter feed and saving them into text files and I want to check the likely hood of spam using a perl script with Mail::SpamAssassin. So I followed this post about loading messages from text. All my messages keep getting marked as 'not spam' in my if statement even the ones that are 419 scams.

Question

  • What am I doing wrong?
  • Do I have to configure Spamasssasin files?
  • Do My messages need to be in a cetain format?
  • Is there a better alternative for my project?

Details:

code:

use Mail::SpamAssassin;
use strict;
use warnings;


open FILE, "<", ~/Messages/twitter_tweet.ema' or die;
my @lines = <FILE>;

my $spamtest = Mail::SpamAssassin->new();
my $mail = $spamtest->parse(\@lines);
my $status = $spamtest->check($mail);
print $status->get_report();

if ($status->is_spam()) {
    print "Totally Spam\n";
} else {
    print "not spam\n";
}

 $status->finish();
 $mail->finish();
 $spamtest->finish();

Output:

(no report template found)

not spam

notes:

I didn't configure spamAssasin I simply started using the perl module

There is a file called ~/.spamassassin/user_prefs in my home directory but i didn't touch it

Community
  • 1
  • 1
Shabbir Hussain
  • 2,600
  • 2
  • 17
  • 25
  • `open FILE, "<", ~/Messages/twitter_tweet.ema' or die;` is this a copy error or is your script also lacking the opening `'` around the filename? – RobEarl Jul 05 '13 at 12:42

1 Answers1

0

I wrote a response (below) not having noted how you started this question. The "I'm downloading my twitter feed and saving them into text files" piece is key. Very key. Specifically, SpamAssassin is designed to scan email, complete with rich metadata from the headers. Twitter feeds do not have headers.

The best spam-fighting techniques I've seen for twitter, which are mostly academic research rather than usable code, involve intense link graphs that track followers and build reputations for each user. This is pretty much the only metadata available in twitter, so SpamAssassin has nothing to go on but the tweet ("body") content itself.

Sure, the Bayesian mechanism can conceivably help, though again it is composed with headers and email-specific tokenization techniques. So too can the URI DNSBLs, but the other lookups (Razor2, Pyzor, all DNSBLs) are useless, as are 99% or so of the regex rule signatures. (Also note that many online indices are tuned for live lookups and therefore expire older entries, so if you scan a spam from a few days ago, it might no longer have an entry even if it once did.)

You're far better off using some content-only spam filter. If you have a large enough collection of messages, you can train a Bayesian based filter on a subset and then run it on the rest. If it's an ongoing effort, correct its mistakes as you find them and it should improve to something usable over time.

If you really really want to use SpamAssassin, read the rest of this answer. Keep in mind that I wrote it assuming you've got real rfc5322 (originally rfc822) email.


There are two possibilities: You have an invocation problem (SpamAssassin does not run properly or you are not properly extracting the verdict) or you have an efficacy problem (SpamAssassin runs but does not have the desired accuracy, in this case, a false negative problem).

Here is the GTUBE test string:

XJS*C4JDBQADN1.NSBN3*2IDNEN*GTUBE-STANDARD-ANTI-UBE-TEST-EMAIL*C.34X

To diagnose between the two, add the above GTUBE test string to a test message (copy a real message and include that string in the body) and then try running your code again.

  • If it does not flag as spam, you have an invocation problem.
  • If your 419 does not flag as spam, you have an efficacy problem.

Invocation problems: Try enabling debug. Paste your output here. I'd need more clues to diagnose that kind of an issue.

Efficacy problems: You can radically improve SpamAssassin's results by ensuring you have blocklists (DNSBLs and URI DNSBLs) and networked plugins (e.g. Razor, Pyzor) and that you are actively training Bayes (which takes 200+ spams and 200+ hams). There are also good tips on the spamtips.org ultimate setup guide.

If you want further help on a particular spam example, you'll have to post the message, with limited redactions if possible, some place that will leave it intact, e.g. Pastebin.com (if it's short enough, you could paste it here at StackOverflow, but most spam is not short).

Community
  • 1
  • 1
Adam Katz
  • 14,455
  • 5
  • 68
  • 83