1

Is it possible to create a Regular Expression in Procmail to filter out a link containing certain words?

For example, I would like to filter all emails that have a hyperlink with the word "unsubscribe" in it (it may not be the only word though). This would filter out a lot of newsletters sent to me in a sub-folder I can check now and again.

For example, I try this:

.*<a.*unsubscribe.*</a>.*

But that would just filter out anything with a link, with some words with unsubscribe in it (not necessarily in the link), then another closing link (not necessarily the first one after the word). It wont filter the last open hyperlink before the word unsubscribe, directly followed by the first closing hyperlink tag.

I cant find any information how to find the last occurrence of some HTML before a word, then the word, then the first occurrence of some HTML after the word, which I guess is what I need to do.

Laurence Cope
  • 403
  • 7
  • 20

1 Answers1

1

This isn't entirely precise, but probably close enough to what you want.

:0B
* <a([  ]+[^ > ]+)*[  ]+href="[^>"]*unsubscribe
unsub/

This looks for an HTML a element with an href attribute in double quotes containing unsubscribe within the body text (B flag). The optional group ([ ]+[^ > ]+)* allows for zero or more other attributes before the href.

As is conventional in Procmail, the whitespace inside [ ] and in [^ > ]should be a space and a tab, in any order. (The mobile device I'm using now won't let me easily enter a tab so this is not copy/paste-proof.)

However, not all HTML is well-formed, not all href attributes are double-quoted, and not all HTML attachments are sent unencoded. In fact, the biggest practical flaw is that quoted-printable HTML is not handled correctly. A simple "80/20" solution would be to change = to =(3D)?; a hugely more complex problem is how to handle all possible QP variations (including an optional equals, newline wrap anywhere); the really correct solution would be to use a properly MIME-aware tool instead of, or from inside of, Procmail; this way, you could also handle base64-encoded HTML transparently.

Superficially, your immediate question is answered by changing the repetition operators from greedy to non-greedy. In so many words, .* will skip as much text as possible, whereas [^>]* will never skip past just before the next occurrence of >. However, as noted above, there are significant additional complications because of how MIME allows for text to be encoded in different ways for safe transfer by email.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Great, thanks a lot I will have a play around and get back to you. I never thought I would filter all emails that should match, but if it filters the majority thats good enough! – Laurence Cope Sep 28 '13 at 14:13