0

I am trying to use regexp_substr to break up table data held in one cell to the individual fields.

the data is , delimited. individual cells can also contain , within quotes and finally some cells can be unpopulated

My sample logic is working for the first 2 requirements but i can't sort the third, please help!

the issue is b4 should be null but it is being returned as F.

SEL
'a, b, c,, F,"d, e, f", g, h' AS f1,
RegExp_Substr(f1,'(".*?"|[^",\s]+)(?=,|$)',1,1) AS b1,
RegExp_Substr(f1,'(".*?"|[^",\s]+)(?=,|$)',1,2) AS b2,
RegExp_Substr(f1,'(".*?"|[^",\s]+)(?=,|$)',1,3) AS b3,
RegExp_Substr(f1,'(".*?"|[^",\s]+)(?=,|$)',1,4) AS b4,
RegExp_Substr(f1,'(".*?"|[^",\s]+)(?=,|$)',1,5) AS b5,
RegExp_Substr(f1,'(".*?"|[^",\s]+)(?=,|$)',1,6) AS b6,
RegExp_Substr(f1,'(".*?"|[^",\s]+)(?=,|$)',1,7) AS b7,
RegExp_Substr(f1,'(".*?"|[^",\s]+)(?=,|$)',1,8) AS b8,
RegExp_Substr(f1,'(".*?"|[^",\s]+)(?=,|$)',1,9) AS b9,
RegExp_Substr(f1,'(".*?"|[^",\s]+)(?=,|$)',1,10) AS b10
;

Thanks

JF

WJF
  • 1
  • Current solution returning as follows b1: a b2: b b3: c b4: F etc should be : b1: a b2: b b3: c b4: b5: F etc – WJF Dec 18 '22 at 17:03

2 Answers2

0

Your input looks like CSV-data. If the number of columns is constant you can utilize the CSVLD table function:

WITH cte AS 
 (  -- base select here
   SELECT 'a, b, c,, F,"d, e, f", g, h'  as f1
   --FROM mytable
 )
SELECT *
FROM TABLE
 (
   CSVLD
    (cte.f1  -- input column
    ,','     -- delimiter character
    ,'"'     -- quote character
    )
   RETURNS
    (
      b1  VarChar(11) CHARACTER SET Unicode
     ,b2  VarChar(11) CHARACTER SET Unicode
     ,b3  VarChar(11) CHARACTER SET Unicode
     ,b4  VarChar(11) CHARACTER SET Unicode
     ,b5  VarChar(11) CHARACTER SET Unicode
     ,b6  VarChar(11) CHARACTER SET Unicode
     ,b7  VarChar(11) CHARACTER SET Unicode
     ,b8  VarChar(11) CHARACTER SET Unicode
    )
 ) AS t
;

If your input column is LATIN remove the CHARACTER SET Unicode frm the output columns.

dnoeth
  • 59,503
  • 4
  • 39
  • 56
  • thanks Dnoeth, i'll certainly take a look, but i don't think the function will handle the volume of data unfortunately. obviously the a,b,c etc i have used are just representative of the real data. in reality these are customer names, address lines, telephone numbers, balances etc. but as i say i will have a look and come back in the morning – WJF Dec 18 '22 at 18:12
  • also the reg expression has been bothering me for so long now, i need to figure it out! – WJF Dec 18 '22 at 18:16
  • Well, the CSVLD function is made for handling large volumes of CSV data and will be much cheaper than lots of RegExes. Plus I've never seen a CSV parsing regex which covers all possible variations, simply search SO :-) – dnoeth Dec 18 '22 at 22:22
  • Of course you are spot on Doneth, thanks again – WJF Dec 19 '22 at 07:32
0

This regex works for your sample case:

(?:,|^) ?(".*?"|[^,]*)

You'll need to use the second group of the match instead of the first.

dc-ddfe
  • 487
  • 1
  • 11
  • works perfectly thanks, only issue is that i don't have access to REGEXP_SUBSTR_GPL so can't specifiy the second group of the match!! there is always something!! I can add asimple case check to look after the leading , on all fields after the first. not eloquent but will do the job fine. – WJF Dec 19 '22 at 08:51
  • This will fail when the data starts with a comma `,a...`. And simply switching to `(?:^|,)` doesn't work, too. Afaik REGEXP_SUBSTR_GPL doesn't require access rights, check if it's installed in TD_SYSFNLIB. – dnoeth Dec 20 '22 at 11:47
  • @dnoeth Not sure what you're talking about, it doesn't fail with a leading comma. I just tested it at [regex101](https://regex101.com/r/MO2yX1/1) – dc-ddfe Dec 20 '22 at 21:00
  • @dc-ddfe Teradata's RegEx are based on PCRE, but sometimes they don't work as expected, just don't ask me why :-) – dnoeth Dec 20 '22 at 21:53
  • Oh, is it a problem specific to Teradata? My bad then, I don't use that technology. I'm just here for the regex – dc-ddfe Dec 20 '22 at 22:02