3

I am parsing out some emails. Mobile Mail, iPhone and I assume iPod touch append a signature as a separate boundary, making it simple to remove. Not all mail clients do, and just use '--' as a signature delimiter.

I need to chop off the '--' from a string, but only the last occurrence of it.

Sample copy

 hello, this is some email copy-- check this out
 --
 Tom Foolery

I thought about splitting on '--', removing the last part, and I would have it, but explode() and split() neither seem to return great values for letting me know if it did anything, in the event there is not a match.

I can not get preg_replace() to go across more than one line. I have standardized all line endings to \n.

What is the best suggestion to end up with hello, this is some email copy-- check this out, taking not, there will be cases where there is no signature, and there are of course going to be cases where I can not cover all the cases.

mickmackusa
  • 43,625
  • 12
  • 83
  • 136

6 Answers6

8

Actually correct signature delimiter is "-- \n" (note the space before newline), thus the delimiter regexp should be '^-- $'. Although you might consider using '^--\s*$', so it'll work with OE, which gets it wrong.

Community
  • 1
  • 1
vartec
  • 131,205
  • 36
  • 218
  • 244
  • I was unaware there was a standard for signature format. Can you cite? – John Saunders Apr 07 '09 at 12:26
  • 1
    Which would be http://tools.ietf.org/html/rfc3676#section-4.3. As the RFC states, it's more a widely accepted convention than a real standard. – Tomalak Apr 07 '09 at 12:29
  • good information but I highly doubt that you could expect it to be consistent. – Kibbee Apr 07 '09 at 13:24
  • @Kibbee: most mailers follow this RFC. Some (like e.g. OE) strip *all* trailing whitespace, '^--\s*$' works in both cases. – vartec Apr 07 '09 at 13:31
  • Apple Mail for example lets you make a sig, I put in '--', but forget at times to put in the '-- '. It certainly allows to you omit the '-- ' entirely if you so desire. Email is about the most amazing mess I have ever dealt with. –  Apr 07 '09 at 21:40
  • 1
    @scott: true, but then there's nothing that can be done about signatures that don't comply. – vartec Apr 08 '09 at 07:02
3

Try this:

preg_replace('/--[\r\n]+.*/s', '', $body)

This will remove everything after the first occurence of -- followed by one or more line break characters. If you just want to remove the last occurence, use /.*--[\r\n]+.*/s instead.

Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • Just to clarify: the final /s makes the regex treat the whole string as a [S]ingle line – Piskvor left the building Apr 07 '09 at 12:58
  • Thanks, can you elaborate how either of those would target the *last* occurance? What if there is a plain text part, and someone pushes in a -- in the middle of it, as well as a signature? I have been considering reversing the string and finding the first occurrence, then putting it back. –  Apr 07 '09 at 21:42
3

Instead of just chopping of everything after -- could you not cache the last few emails sent by that user or service and compare. The bit at the bottom that looks like the others can be safely removed leaving the proper message intact.

Tom
  • 33,626
  • 31
  • 85
  • 109
  • 2
    I have considered things like this. With Mobile Mail on iphone, Touch, gmail, outlook, and all the ways in which people move around these days, I figure there is no way to get a clear idea of what client they will be using at any given time. –  Apr 07 '09 at 21:37
3

I think in the interest of being more bulletproof, I will take the non regex route

        echo substr($body, 0, strrpos($body, "\n--"));
  • 1
    If $body does not contain "\n --", the strrpos function will return false which will cause substr to return an empty string. Wrap it around an if() statement first and check for "\n--". – Andy Feb 07 '12 at 07:32
2

This seems to give me the best result:

$body = preg_replace('/\s*(.+)\s*[\r\n]--\s+.*/s', '$1', $body);

  • It will match and trim the last "(newline)--(optional whitespace/newlines)(signature)"
  • Trim all remaining newlines before the signature
  • Trim beginning/ending whitespace from the body (remaining newlines before the signature, whitespace at the start of the body, etc)
  • Will only work if there's some text (non-whitespace) before the signature (otherwise it won't strip the signature and return it intact)
Kemal
  • 2,602
  • 1
  • 21
  • 14
0

To cleanly remove all of the signature and its leading newline characters, perform greedy matching upto the the last occurring --. Before matching the last -- followed by zero or more spaces then a system-agnostic newline character, restart the fullstring match using \K, then match all of the remaining string to be replaced.

Code: (Demo)

$string = <<<BODY
hello, this is some email copy-- check this out
--
Tom Foolery
BODY;

var_export(preg_replace('~.*\K\R-- *\R.*~s', '', $string));

Output:

'hello, this is some email copy-- check this out'
mickmackusa
  • 43,625
  • 12
  • 83
  • 136