How to accurately extract the email and unit number string text from OCR content?

Question

I've used google cloud vision OCR to extract business card email string text from here and used the below regular expression to try to extract but without much good results. Any better suggestions to increase the performance?

function extract_emails($str){
    // This regular expression extracts all emails from a string:
    $regexp = '/([a-z0-9_\.\-])+\@(([a-z0-9\-])+\.)+([a-z0-9]{2,4})+/i';
    preg_match_all($regexp, $str, $m);

    return isset($m[0]) ? $m[0] : array();
}

$Email = extract_emails($gcv_response);

if (!empty($Email))
{
    $Email = reset($Email); 
}
else
{
    $Email = 'NULL';
}

OCR text 1: "ALGEN MARINE PTE LTD Specialist in Fire Protection and Safety Engineering Philip Cheng Assistant Sales Manager 172 Tuas South Avenue 2, West Point Bizhub, Singapore 637191 Email: philip @algen.comsg Website: www.algen.comsg Tel: (65) 6898 2292 Fax: (65) 6898 2202 (65) 6898 2813 HP : (65) 9168 9799"

Result from the running the above code = NULL; Desired output: philip@algen.comsg

OCR text 2: "Allan Lim Yee Chian Chief Executive Officer Alpha Biofuels (S) Pte Ltd LHCCBNFLN FR2 a mobile 9790 3063 tel 6264 6696 fax 6260 2082 C#01-05, 2 Tuas South Ave 2 Singapore 637601 tang. Steve. Eric@alphabiofuels.sg www.alphabiofuels.sg"

Result from the running the above code = NULL; Desired output: tang.Steve.Eric@alphabiofuels.sg;

What do you mean by _without much good results_? Are certain emails not matching? or is it too slow? — degant, May 08 '17 at 09:49
Results were inconsistent such as OCR result "xxxxxxxxx ericmay. micheal@amd.com xxxx" come out as "micheal@amd.com" where the real result could have been "ericmay.micheal@amd.com" I could remove all the space in between but that would not have work very well? Rather new in regular expression or there is flaws in my regular expression — Vivian, May 08 '17 at 09:54
seems to work fine. if you need to capture multiple emailaddresses add /g — RST, May 08 '17 at 09:55
@Vivian you should probably add examples of what OCR results you are getting and which ones aren't being extracted correctly so people can suggest improvements in your regex. Include positive that should work work and negative ones that shouldn't — degant, May 08 '17 at 09:57
i only wanted to capture a single email address captured from business card from the OCR string text — Vivian, May 08 '17 at 09:57
It is not possible like that. How is the system going to know if it is a random word or part of the emailaddress? If the info has something like `Emailaddress: xxxxxxxxx ericmay. micheal@amd.com xxxx Phone:` then you can use these boundaries to determine the emailaddress string — RST, May 08 '17 at 10:06
Updated my questions with your request, rather new in regular and staff have been searching to no avail. — Vivian, May 08 '17 at 10:11
Just a side note if i want to extract the unit number only from the above OCR text such as (#03-26, #B4-47 or #01-05) in the above address context? Then recombine it with another string text with $result=$googleAddress.$unitNum; How do i use the address $unitNum = "# reg exp?" — Vivian, May 08 '17 at 10:18
Unfortunately there is no way to tell the system which words to include as part of your email address and which to exclude. For the 2 sample inputs that you have provided, the regex `([a-zA-Z_\.\-\s]){1,64}\@(([a-z0-9\-])+\.)+([a-z0-9]{2,})+` works. Demo: https://regex101.com/r/115aTD/1. Spaces can then be removed from the output to form the email address. Note that this regex allows no numbers in the first part of the email address. — degant, May 08 '17 at 11:16
@degant Hi it worked but not all cases by the way you read my previous comment before you commented about another similar question i tried ^#\w{1,4}.+(\w{1,4})$ without much luck — Vivian, May 08 '17 at 13:27
You can share for what cases it isn't working but like I said before, email is a tough problem to solve using regex in your case. For the Unit Number you can use this: `#(\d{2}-\d{2})` — degant, May 08 '17 at 13:32

score 2 · Answer 1 · answered Jul 17 '18 at 12:18

The two problems you were facing were that you were not converting your code to lowercase and second thing is that you have not covered the scenario of spaces occurring in your code. I tried to cover those but you have to modify according to your requirements.

function extract_emails($str){
    // This regular expression extracts all emails from a string:
    $regexp = '/(([a-z0-9_\-])+\.\\s?)?/([a-z0-9_\.\-])+\\s?\@(([a-z0-9\-])+\.)+([a-z0-9]{2,4})+/i';
    //$regexp = '/(([a-zA-Z0-9_\-])+\.\\s?)?/([a-zA-Z0-9_\.\-])+\\s?\@(([a-z0-9\-])+\.)+([a-z0-9]{2,4})+/i';//for using uppercase letters.

preg_match_all($regexp, strtolower($str), $m);

    return isset($m[0]) ? $m[0] : array();
}

$Email = extract_emails($gcv_response);

if (!empty($Email))
{
    $Email = reset($Email); 
}
else
{
    $Email = 'NULL';
}

How to accurately extract the email and unit number string text from OCR content?

1 Answers1