1

I have read similar questions here, but being that all regular expressions are not created equal, I was not able to find a solution to my problem.

I am working on a rule for SpamAssassin that will tell if the recipient's e-mail username is contained in the body of the message. For example, an e-mail sent to testuser@somedomain.com contains testuser in the body of the message. I have written and tested a regular expression on Regex-101 and am able to match it as expected, but when I create the rule it does not work when I test it in SpamAssassin.

Here is the expression:

/To:\s([a-z0-9][-a-z0-9]{1,19})\@somedomain\.com[a-z0-9\s=;:\/\.-]*\1\b/i

What is should do is match an e-mail address in the To: header (or anywhere in the body of the message matching the format To: user@somedomain.com. As I mentioned before, the expression matches as expected on Regex-101, but when I make a rule in SpamAssassin, it does not match.

If I remove the leading To:\s then it does match, but I am only concerned with matching the e-mail in the To: header. I have tried these various mutations of the expression:

/To:\s([a-z0-9][-a-z0-9]{1,19})\@somedomain\.com[a-z0-9\s=;:\/\.-]*\1\b/i
/To: ([a-z0-9][-a-z0-9]{1,19})\@somedomain\.com[a-z0-9\s=;:\/\.-]*\1\b/i
/To:[\s]{0,2}([a-z0-9][-a-z0-9]{1,19})\@somedomain\.com[a-z0-9\s=;:\/\.-]*\1\b/i
/:\s([a-z0-9][-a-z0-9]{1,19})\@somedomain\.com[a-z0-9\s=;:\/\.-]*\1\b/i

/\s([a-z0-9][-a-z0-9]{1,19})\@somedomain\.com[a-z0-9\s=;:\/\.-]*\1\b/i

None of the previous rules match, but this one does:

/([a-z0-9][-a-z0-9]{1,19})\@somedomain\.com[a-z0-9\s=;:\/\.-]*\1\b/i

Here is the text I am using for testing:

Subject: Test spam mail (GTUBE) private jet rental
Message-ID: <GTUBE1.1010101@example.net>
Date: Wed, 23 Jul 2003 23:30:00 +0200
From: Sender <sender@live.com>
To: recipient@somedomain.com
Precedence: junk
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
recipient
This is the GTUBE, the
    Generic
    Test for
    Unsolicited
    Bulk
    Email

Which should match on the To: recipient@somedomain.com .... recipient, but I can only get it to match when I remove the To:\s from the expression. The full expression tests out in Regex-101, so it seems to be something specific to SpamAssassin, but I'm not sure.

EDIT

Here is an updated version of the expression to NOT allow a dash at the end of the username, but will allow in the middle:

/\bTo:\s([a-z0-9][-a-z0-9]{0,18}[a-z0-9])\@somedomain\.com[a-z0-9\s=;:\/\.-]*\b\1\b/i
dub stylee
  • 3,252
  • 5
  • 38
  • 59
  • Why are you tacking on the backref to the user name at the end (followed by a word boundary)? One thing, with that, you allow the dash `-` character at the end of the backref, but the dash is _not_ a word, so the boundary would have to be preceeding an actual word. –  Jul 07 '15 at 22:35
  • The backref to the username is at the end followed by a word boundary so that, as in this example, `recipient` will need to be present in the text following the `recipient@somedomain.com` in order for the rule to match. The word boundary is so that if the username is part of another word, then not to match, such as `recipients`. – dub stylee Jul 07 '15 at 22:37
  • The `-` is allowed in the backref in order to handle e-mail addresses such as `some-user55@somedomain.com`. – dub stylee Jul 07 '15 at 22:39
  • Then you should not allow dash `-` as the last character in the user name, it will cause problems. You can overcome that with a conditional boundary `(?(?<=\w)\b)` if your engine supports conditionals. –  Jul 07 '15 at 22:42
  • Anyway, just peel off expressions from the start and end until you get some matches. Modify as needed, then add it back. –  Jul 07 '15 at 22:43
  • I have done that. I posted all variations of the expression that I tried, as well as the one that worked. The question is related to why the expressions with the `:` before the username do not work in SpamAssassin, but do work in Regex-101? – dub stylee Jul 07 '15 at 22:45
  • `To:\s([a-z0-9][-a-z0-9]{1,19})\@somedomain\.com[a-z0-9\s=;:/\.-]*\1\b` works on that sample, but I'd use `\s+` after the `To:` incase there is many whitespace. –  Jul 07 '15 at 22:51
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/82650/discussion-between-dub-stylee-and-sln). – dub stylee Jul 07 '15 at 23:00

1 Answers1

2

With assistance from @sln in chat, we came up with the following expression that matches the full rule as expected:

/To:\s+([a-z0-9][-a-z0-9]{1,18}[a-z0-9])\@somedomain\.com[\S\s]*?\1\b/i

That will match To: username@somedomain.com ... username, so it should, for the most part, match on any e-mail message that contains the recipient's username in the body of the message. In our case, many of the spam e-mails we receive will contain the username, such as:

Greetings username!  Blah Blah Blah spam message.

What ended up fixing it was replacing the [a-z0-9\s=;:\/\.-]* following the e-mail address with [\S\s]*?

dub stylee
  • 3,252
  • 5
  • 38
  • 59