1

I want to use regular expressions to detect a URL via SpamAssassin.

I've found that the following works great in various methods I use:

http(s)?://([a-zA-Z0-9.])+.[a-zA-Z]{2,3}

However, this does not work in SpamAssassin.

I get the following error if I try to use any semblance of the regular expression above:

[root@~]spamassassin --lint
Aug 13 19:30:25.005 [38721] warn: Having no space between pattern and following word is deprecated at /etc/mail/spamassassin/local.cf, rule HAS_LINK, line 14.
Aug 13 19:30:25.005 [38721] warn: Bareword found where operator expected at /etc/mail/spamassassin/local.cf, rule HAS_LINK, line 14, near "var"
Aug 13 19:30:25.005 [38721] warn:  (Missing operator before var?)
Aug 13 19:30:25.005 [38721] warn: Misplaced _ in number at /etc/mail/spamassassin/local.cf, rule HAS_LINK, line 14.
Aug 13 19:30:25.005 [38721] warn: Bareword found where operator expected at /etc/mail/spamassassin/local.cf, rule HAS_LINK, line 14, near "72_active"
Aug 13 19:30:25.005 [38721] warn:  (Missing operator before active?)
Aug 13 19:30:25.005 [38721] warn: Bareword found where operator expected at /etc/mail/spamassassin/local.cf, rule HAS_LINK, line 16, near "", ruletype => "body"
Aug 13 19:30:25.005 [38721] warn:  (Missing operator before body?)
Aug 13 19:30:25.005 [38721] warn: String found where operator expected at /etc/mail/spamassassin/local.cf, rule HAS_LINK, line 28, near "body"); 
Aug 13 19:30:25.005 [38721] warn:  last;
Aug 13 19:30:25.005 [38721] warn:  }
Aug 13 19:30:25.005 [38721] warn:  }
Aug 13 19:30:25.005 [38721] warn:  
Aug 13 19:30:25.005 [38721] warn:  
Aug 13 19:30:25.005 [38721] warn:  }
Aug 13 19:30:25.005 [38721] warn:  
Aug 13 19:30:25.005 [38721] warn:  if ($scoresptr->{q{FUZZY_ERECT}}) {
Aug 13 19:30:25.005 [38721] warn:  
Aug 13 19:30:25.005 [38721] warn:  foreach my $l (@_) {
Aug 13 19:30:25.005 [38721] warn:  
Aug 13 19:30:25.005 [38721] warn: #line 1 ""
Aug 13 19:30:25.005 [38721] warn:  (Might be a runaway multi-line "" string starting on line 16)
Aug 13 19:30:25.006 [38721] warn: Having no space between pattern and following word is deprecated at /etc/mail/spamassassin/local.cf, rule HAS_LINK, line 28.
Aug 13 19:30:25.006 [38721] warn: Misplaced _ in number at /etc/mail/spamassassin/local.cf, rule HAS_LINK, line 28.
Aug 13 19:30:25.006 [38721] warn: Bareword found where operator expected at /etc/mail/spamassassin/local.cf, rule HAS_LINK, line 28, near "25_replace"
Aug 13 19:30:25.006 [38721] warn:  (Missing operator before replace?)
Aug 13 19:30:25.006 [38721] warn: Bareword found where operator expected at /etc/mail/spamassassin/local.cf, rule HAS_LINK, line 30, near "", ruletype => "body"
Aug 13 19:30:25.006 [38721] warn:  (Missing operator before body?)
Aug 13 19:30:25.006 [38721] warn: String found where operator expected at /etc/mail/spamassassin/local.cf, rule HAS_LINK, line 42, near "body"); 
Aug 13 19:30:25.006 [38721] warn:  last;
Aug 13 19:30:25.006 [38721] warn:  }
Aug 13 19:30:25.006 [38721] warn:  }
Aug 13 19:30:25.006 [38721] warn:  
Aug 13 19:30:25.006 [38721] warn:  
Aug 13 19:30:25.006 [38721] warn:  }
Aug 13 19:30:25.006 [38721] warn:  
Aug 13 19:30:25.006 [38721] warn:  if ($scoresptr->{q{MORE_SEX}}) {
Aug 13 19:30:25.006 [38721] warn:  
Aug 13 19:30:25.006 [38721] warn:  foreach my $l (@_) {
Aug 13 19:30:25.006 [38721] warn:  
Aug 13 19:30:25.006 [38721] warn: #line 1 ""
Aug 13 19:30:25.006 [38721] warn:  (Might be a runaway multi-line "" string starting on line 30)
Aug 13 19:30:25.006 [38721] warn: Having no space between pattern and following word is deprecated at /etc/mail/spamassassin/local.cf, rule HAS_LINK, line 42.
Aug 13 19:30:25.006 [38721] warn: Misplaced _ in number at /etc/mail/spamassassin/local.cf, rule HAS_LINK, line 42.
Aug 13 19:30:25.006 [38721] warn: Bareword found where operator expected at /etc/mail/spamassassin/local.cf, rule HAS_LINK, line 42, near "20_phrases"
Aug 13 19:30:25.006 [38721] warn:  (Missing operator before phrases?)
Aug 13 19:30:25.007 [38721] warn: Bareword found where operator expected at /etc/mail/spamassassin/local.cf, rule HAS_LINK, line 44, near "", ruletype => "body"
Aug 13 19:30:25.007 [38721] warn:  (Missing operator before body?)
Aug 13 19:30:25.007 [38721] warn: String found where operator expected at /etc/mail/spamassassin/local.cf, rule HAS_LINK, line 44, at end of line
Aug 13 19:30:25.007 [38721] warn:  (Missing semicolon on previous line?)
Aug 13 19:30:25.007 [38721] warn: rules: failed to compile Mail::SpamAssassin::Plugin::Check::_body_tests_0_3, skipping:
Aug 13 19:30:25.007 [38721] warn:  (Can't find string terminator '"' anywhere before EOF at /etc/mail/spamassassin/local.cf, rule HAS_LINK, line 44.)
Aug 13 19:30:25.140 [38721] warn: lint: 1 issues detected, please rerun with debug enabled for more information
Michael Currie
  • 13,721
  • 9
  • 42
  • 58
skrilled
  • 5,350
  • 2
  • 26
  • 48
  • 2
    Try this regex instead: `https?:\/\/([a-zA-Z0-9_\-]+\.)?[a-zA-Z0-9_\-]+\.[a-z]{2,3}` (that won't detect every URL, but the majority). Do you mind to publish your config that triggers those warnings? – CarHa Aug 14 '15 at 02:53
  • If you would like to make that an answer I'll select it as best. That ended up working with no errors in SA :) – skrilled Aug 14 '15 at 17:28

1 Answers1

2

This regex (https?:\/\/([a-zA-Z0-9_\-]+\.)+(mobi|[a-z]{2,3})) detects common URLs.

It doesn't detect URLs with generic TLDs. If you need to detect those as well, I would add them to the mobi-list.


To your regex: a dot has to be escaped if you want to detect it literally, as well as some characters, which have a special meaning in regexes, like *, /, ?, etc.

https://regex101.com is a good reference and test site for regexes and gives you also useful explanations.

CarHa
  • 1,148
  • 11
  • 31