Teradata Query to fetch email id from string data

Question

Input:

Data storing in column is string within that email ids are present. Need to fetch email ids.There is no fixed pattern. Example :

Field 1 
test – test12 – test@gmail.com - test1@gmail.com    
abc - abc@gmail.com - abc1@gmail.com - abc2@gmail.com
cde  cde@gmail.com
Comment generated by:Please contact : efg@gmail.com for any queries

Output:

Field 1
test@gmail.com - test1@gmail.com 
abc@gmail.com - abc1@gmail.com - abc2@gmail.com
cde@gmail.com
efg@gmail.com

There might be better ways, but one thought is to split the string to rows using something like `STRTOK_SPLIT_TO_TABLE` (delimited by space), then keep the rows which have an @ symbol, then put those rows back into a string using perhaps conditional aggregation (assuming you don't have too many emails per row). — Isolated, Feb 24 '23 at 15:00

Andrew · Answer 1 · 2023-02-24T16:23:38.750

Here's a really ugly solution using strtok_split_to_table (splitting your string into rows based on the space delimiter, filtering out rows that don't contain "@", and then concatenating them back together using XML_AGG (more or less the same as @Isolated's comment).

create volatile table vt_dummy (col1 int) on commit preserve rows;
insert into vt_dummy values(1);

with cte as (
select 
    1 as theId,
    'abc - abc@gmail.com - abc1@gmail.com - abc2@gmail.com' as str
    from vt_dummy
    UNION ALL
select
    2,
    'test@gmail.com - test1@gmail.com'
    from vt_dummy
)

select
theid,
trim(TRIM(TRAILING '-' FROM (XMLAGG(theWord || ' - ' ORDER BY tok_num) (VARCHAR(500)))))
from (
select
*
from table (strtok_split_to_table (cte.theId,cte.str,' ')
returns (theId integer,tok_num integer, theWord varchar(1000))) as t
where
 theWord like '%@%'
) t
group by theid

score 0 · Answer 2 · answered Feb 24 '23 at 17:18

0

If there's a maximum number of emails a regex will do the job:

                   regexp_substr(field_1, '([^ ]+@[^ ]+\.[^ ]+)', 1, 1, 'i')      -- 1st match
|| coalesce(', ' ||regexp_substr(field_1, '([^ ]+@[^ ]+\.[^ ]+)', 1, 2, 'i'), '') -- 2nd match
|| coalesce(', ' ||regexp_substr(field_1, '([^ ]+@[^ ]+\.[^ ]+)', 1, 3, 'i'), '') -- 3rd match 
... repeat for additional emails

Of course this regex could be replaced by a better one (you can easily search for it) returning less false positives.

answered Feb 24 '23 at 17:18

dnoeth

59,503
4
39
56

Thanks @dnoeth.Can you help in understanding code.Its working fine but wanted to know how exactly its working? – Sak Feb 24 '23 at 17:46
The RegEx looks for a `aaaa@bbbb.ccc` pattern between spaces and the 4th parameter is the occurance. The Coalesce returns either the nth email or an empty string, – dnoeth Feb 24 '23 at 18:51
for fourth time pattern matching its throwing error : [Teradata][ODBC Teradata Driver][Teradata Database](-3798)A column or character expression is larger than the max size. SELECT Command Failed. – Sak Mar 03 '23 at 20:10
Oops, those functions return a huge varchar and adding #4 hits the 32000 char limit: casting each regexp_substr to a varchar(128 or similar) will solve this issue. – dnoeth Mar 04 '23 at 09:40
thanks @dnoeth.I have added the cast and its working.However as per logic I am getting some false positive where space is not present.Is there anything that I can add in logic to correct the below output as I am getting incorrect output for below scenarios. input : i) table created by test@gmail.com,test1@yahoo.com.This table can be dropped ii) DROP TABLE AFTER 2023-03-04:sample@gmail.com iii) FT1 –sample1@gmail.com-sample2@gmail.com output : i)test@gmail.com,test1@yahoo.com.This ii)2023-03-04:sample@gmail.com iii)–sample1@gmail.com-sample2@gmail.com – Sak Mar 04 '23 at 23:01
*could be replaced by a better one*, e.g. `(\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b)` But this still fails for `yahoo.com.This`, would require a list of all possible top level domains https://www.iana.org/domains/root/db – dnoeth Mar 05 '23 at 10:19

Teradata Query to fetch email id from string data

2 Answers2