1

I operate an archive of e-mail for a law firm that receives mail from Postfix and uses a PHP script to insert messages into a database. This works mostly fine but sometimes the regular expression I use to parse e-mail addresses from the From, To, and Cc headers does not capture e-mail addresses with 100% accuracy. I have tried the other solutions posited here on stackoverflow (using filter_var(), using imap_rfc822_parse_adrlist, using the regex in question 1028553) with actually less success than what I have.

I am looking to minimize system calls (I use way too many pregs right now) and increase accuracy. The current function takes an input of header text (the From, To, or Cc fields) and returns "clean" e-mail addresses stripped of brackets, quotes, comments, etc.

Any help anyone can provide would be appreciated, as I am stumped!

Wendy

My function:

function return_proper ($email_string) {
if (is_array($email_string)) {
    $x = "";
    foreach ($email_string as $val) {
        $x .= "$val,";
    }
    $email_string = substr($x, 0, -1);
}

$email_string = strtolower(preg_replace('/.*?([A-Za-z0-9\_\+\.\'-]+@[A-Za-z0-9\.-]+).*?/', '$1,', $email_string));
$email_string = preg_replace('/\>/', "", $email_string);
$email_string = preg_replace('/,$/', "", $email_string);
$email_string = preg_replace('/^\'/', "", $email_string);
return $email_string;
}
Matt
  • 22,721
  • 17
  • 71
  • 112
  • Do you have an example of some text that it fails on? – Eli Jan 11 '12 at 16:55
  • 1
    Yet another RegExp ... but it's a good one; take a look at those provided by Hexillion at: http://hexillion.com/samples/ - mix that with something like PHPs `getmxrr` function to ensure the domain has a valid MX record and you'll probably not go far wrong. – CD001 Jan 11 '12 at 17:00
  • I've supplied some samples below...the original input string is on top followed by the results of the function. It *mostly* works, but I'd like to simplify into one preg_match_all and maybe catch some of these errors better. – Wendy Thompson Jan 11 '12 at 17:39
  • [11-Jan-2012 11:36:14] (Mail Delivery System) [11-Jan-2012 11:36:14] mailer-daemon@zcsmcmailsec01.ensue.com, (mail delivery system) (in this case the function doesnt strip the ", (mail delivery system)" text – Wendy Thompson Jan 11 '12 at 17:41
  • 1
    I might be thinking about this a bit backwards but what you want is something that extracts only the valid email part from a given string... so why not look for that match with wildcards at the start/end but only match the bit that erm... matches. Something like `/^.*([VALID_EMAIL_REGEXP_PART).*$/` basically so any invalid characters before or after the email address are removed? Would need some looking at with PHPs greedy/ungreedy RegExps maybe but it _should_ work. – CD001 Jan 16 '12 at 16:26

1 Answers1

0
function return_proper($email_string) {
    if (is_array($email_string)) {
        // Deal with array
        foreach ($email_string as $email_string_line) {
            $results[] = return_proper($email_string_line);
        }

        $result = implode(',', $results);
    } else {
        preg_match_all('/[A-Za-z0-9\_\+\.\'-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]+/', $email_string, $matches);

        $result = implode(',', $matches[0]);
    }

    return strtolower($result);
}
Rannnn
  • 582
  • 4
  • 10