0

Input:

Data storing in column is string within that email ids are present. Need to fetch email ids.There is no fixed pattern. Example :

Field 1 
test – test12 – test@gmail.com - test1@gmail.com    
abc - abc@gmail.com - abc1@gmail.com - abc2@gmail.com
cde  cde@gmail.com
Comment generated by:Please contact : efg@gmail.com for any queries

Output:

Field 1
test@gmail.com - test1@gmail.com 
abc@gmail.com - abc1@gmail.com - abc2@gmail.com
cde@gmail.com
efg@gmail.com
Sak
  • 61
  • 6
  • 1
    There might be better ways, but one thought is to split the string to rows using something like `STRTOK_SPLIT_TO_TABLE` (delimited by space), then keep the rows which have an @ symbol, then put those rows back into a string using perhaps conditional aggregation (assuming you don't have too many emails per row). – Isolated Feb 24 '23 at 15:00

2 Answers2

0

Here's a really ugly solution using strtok_split_to_table (splitting your string into rows based on the space delimiter, filtering out rows that don't contain "@", and then concatenating them back together using XML_AGG (more or less the same as @Isolated's comment).

create volatile table vt_dummy (col1 int) on commit preserve rows;
insert into vt_dummy values(1);

with cte as (
select 
    1 as theId,
    'abc - abc@gmail.com - abc1@gmail.com - abc2@gmail.com' as str
    from vt_dummy
    UNION ALL
select
    2,
    'test@gmail.com - test1@gmail.com'
    from vt_dummy
)

select
theid,
trim(TRIM(TRAILING '-' FROM (XMLAGG(theWord || ' - ' ORDER BY tok_num) (VARCHAR(500)))))
from (
select
*
from table (strtok_split_to_table (cte.theId,cte.str,' ')
returns (theId integer,tok_num integer, theWord varchar(1000))) as t
where
 theWord like '%@%'
) t
group by theid
Andrew
  • 8,445
  • 3
  • 28
  • 46
0

If there's a maximum number of emails a regex will do the job:

                   regexp_substr(field_1, '([^ ]+@[^ ]+\.[^ ]+)', 1, 1, 'i')      -- 1st match
|| coalesce(', ' ||regexp_substr(field_1, '([^ ]+@[^ ]+\.[^ ]+)', 1, 2, 'i'), '') -- 2nd match
|| coalesce(', ' ||regexp_substr(field_1, '([^ ]+@[^ ]+\.[^ ]+)', 1, 3, 'i'), '') -- 3rd match 
... repeat for additional emails

Of course this regex could be replaced by a better one (you can easily search for it) returning less false positives.

dnoeth
  • 59,503
  • 4
  • 39
  • 56
  • Thanks @dnoeth.Can you help in understanding code.Its working fine but wanted to know how exactly its working? – Sak Feb 24 '23 at 17:46
  • The RegEx looks for a `aaaa@bbbb.ccc` pattern between spaces and the 4th parameter is the occurance. The Coalesce returns either the nth email or an empty string, – dnoeth Feb 24 '23 at 18:51
  • for fourth time pattern matching its throwing error : [Teradata][ODBC Teradata Driver][Teradata Database](-3798)A column or character expression is larger than the max size. SELECT Command Failed. – Sak Mar 03 '23 at 20:10
  • Oops, those functions return a huge varchar and adding #4 hits the 32000 char limit: casting each regexp_substr to a varchar(128 or similar) will solve this issue. – dnoeth Mar 04 '23 at 09:40
  • thanks @dnoeth.I have added the cast and its working.However as per logic I am getting some false positive where space is not present.Is there anything that I can add in logic to correct the below output as I am getting incorrect output for below scenarios. input : i) table created by test@gmail.com,test1@yahoo.com.This table can be dropped ii) DROP TABLE AFTER 2023-03-04:sample@gmail.com iii) FT1 –sample1@gmail.com-sample2@gmail.com output : i)test@gmail.com,test1@yahoo.com.This ii)2023-03-04:sample@gmail.com iii)–sample1@gmail.com-sample2@gmail.com – Sak Mar 04 '23 at 23:01
  • *could be replaced by a better one*, e.g. `(\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b)` But this still fails for `yahoo.com.This`, would require a list of all possible top level domains https://www.iana.org/domains/root/db – dnoeth Mar 05 '23 at 10:19