1

[update:] Accepted answer suggests, this can not be done with the python re library in one step. If you know otherwise, please comment.

I'm reverse-engineering a massive ETL pipeline, I'd like to extract the full data lineage from stored procedures and views.

I'm struggling with the following regexp.

import re

select_clause = "`data_staging`.`CONVERT_BOGGLE_DATE`(`landing_boggle_replica`.`CUST`.`u_birth_date`) AS `birth_date`,`data_staging`.`CONVERT_BOGGLE_DATE`(`landing_boggle_replica`.`CUST`.`u_death_date`) AS `death_date`,(case when (isnull(`data_staging`.`CONVERT_BOGGLE_DATE`(`landing_boggle_replica`.`CUST`.`u_death_date`)) and (`landing_boggle_replica`.`CUST`.`u_cust_type` <> 'E')) then timestampdiff(YEAR,`data_staging`.`CONVERT_BOGGLE_DATE`(`landing_boggle_replica`.`CUST`.`u_birth_date`),curdate()) else NULL end) AS `age_in_years`,nullif(`landing_boggle_replica`.`CUST`.`u_occupationCode`,'') AS `occupation_code`,nullif(`landing_boggle_replica`.`CUST`.`u_industryCode`,'') AS `industry_code`,((`landing_boggle_replica`.`CUST`.`u_intebank` = 'Y') or (`sso`.`u_mySecondaryCust` is not null)) AS `online_web_enabled`,(`landing_boggle_replica`.`CUST`.`u_telebank` = 'Y') AS `online_phone_enabled`,(`landing_boggle_replica`.`CUST`.`u_hasProBank` = 1) AS `has_pro_bank`"

# this captures every occurrence of the source fields, but not the target
okay_pattern = r"(?i)((`[a-z0-9_]+`\.`[a-z0-9_]+`)[ ,\)=<>]).*?"

# this captures the target too, but captures only the first input field
wrong_pattern = r"(?i)((((`[a-z0-9_]+`\.`[a-z0-9_]+`)[ ,\)=<>]).*?AS (`[a-z0-9_]+)`).*?)"

re.findall(okay_pattern, select_clause)
re.findall(wrong_pattern, select_clause)

TLDR: I'd like to capture

[aaa, bbb, XXX],
[eee, fff, ..., ooo, YYY],
[ppp, ZZZ]

from a string like

"...aaa....bbb...XXX....eee...fff...[many]...ooo... YYY...ppp...ZZZ...."

where a,b,e,f,h match one pattern, X,Y,Z match another, and the first pattern might occur up to ~20 times, before the second one appears, which always appears alone.

I'm open to solutions with the sqlglot, sql-metadata, or sqlparse libraries as well, it is just regex is better documented.

(Probably I'm code golfing, and I should do this in several steps, starting with splitting the string into individual expressions.)

Lorinc Nyitrai
  • 968
  • 1
  • 10
  • 27

2 Answers2

2

You may use this regex with 3 capture and 1 non-capture groups:

(\w+)\.+(\w+)(?:\.+(\w+))?

RegEx Demo

Code:

import re
s = '...aaa....bbb...XXX....eee...fff...YYY...hhh...ZZZ....'
print (re.findall(r'(\w+)\.+(\w+)(?:\.+(\w+))?', s))

Output:

[('aaa', 'bbb', 'XXX'), ('eee', 'fff', 'YYY'), ('hhh', 'ZZZ', '')]
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • Nice and simple. How can I modify this regex to match any number of pattern_1 groups? In my example there might be up to ~20 occasions before a pattern_2 appears. – Lorinc Nyitrai Aug 29 '22 at 03:59
  • ok then how is pattern_1 defined? Does it always start with lowercase letters where `pattern_2` starts with upper case? – anubhava Aug 29 '22 at 04:29
  • 1
    Also please keep in mind that we cannot grab repeated captured group in python `re`. e.g. you can use this regex: `([a-z]\w*(?:\.+[a-z]\w*){0,19})(?:\.+(\w+))?` but there will be only 2 captured groups per match. – anubhava Aug 29 '22 at 04:44
  • 1
    In this case your initial proposal is as close, as I can get. Thank you. – Lorinc Nyitrai Aug 29 '22 at 04:47
0

Here's two regexes, one to group things by the outside pattern, and one for the inside:

(.*?)(XXX|YYY|ZZZ)
(aaa|bbb|ccc|ddd|eee|fff|ggg)

What I would suggest is matching the whole string with the first regex, and then using the second regex on the first regex's match (.*?)

By using these two regexes, your matches will be grouped first by the outer pattern and then by the inner pattern, but the regexes themselves doesn't have to be overly complicated.