capture pattern_X repeatedly, then capture pattern_Y once, then repeat until EOS

Question

[update:] Accepted answer suggests, this can not be done with the python re library in one step. If you know otherwise, please comment.

I'm reverse-engineering a massive ETL pipeline, I'd like to extract the full data lineage from stored procedures and views.

I'm struggling with the following regexp.

import re

select_clause = "`data_staging`.`CONVERT_BOGGLE_DATE`(`landing_boggle_replica`.`CUST`.`u_birth_date`) AS `birth_date`,`data_staging`.`CONVERT_BOGGLE_DATE`(`landing_boggle_replica`.`CUST`.`u_death_date`) AS `death_date`,(case when (isnull(`data_staging`.`CONVERT_BOGGLE_DATE`(`landing_boggle_replica`.`CUST`.`u_death_date`)) and (`landing_boggle_replica`.`CUST`.`u_cust_type` <> 'E')) then timestampdiff(YEAR,`data_staging`.`CONVERT_BOGGLE_DATE`(`landing_boggle_replica`.`CUST`.`u_birth_date`),curdate()) else NULL end) AS `age_in_years`,nullif(`landing_boggle_replica`.`CUST`.`u_occupationCode`,'') AS `occupation_code`,nullif(`landing_boggle_replica`.`CUST`.`u_industryCode`,'') AS `industry_code`,((`landing_boggle_replica`.`CUST`.`u_intebank` = 'Y') or (`sso`.`u_mySecondaryCust` is not null)) AS `online_web_enabled`,(`landing_boggle_replica`.`CUST`.`u_telebank` = 'Y') AS `online_phone_enabled`,(`landing_boggle_replica`.`CUST`.`u_hasProBank` = 1) AS `has_pro_bank`"

# this captures every occurrence of the source fields, but not the target
okay_pattern = r"(?i)((`[a-z0-9_]+`\.`[a-z0-9_]+`)[ ,\)=<>]).*?"

# this captures the target too, but captures only the first input field
wrong_pattern = r"(?i)((((`[a-z0-9_]+`\.`[a-z0-9_]+`)[ ,\)=<>]).*?AS (`[a-z0-9_]+)`).*?)"

re.findall(okay_pattern, select_clause)
re.findall(wrong_pattern, select_clause)

TLDR: I'd like to capture

[aaa, bbb, XXX],
[eee, fff, ..., ooo, YYY],
[ppp, ZZZ]

from a string like

"...aaa....bbb...XXX....eee...fff...[many]...ooo... YYY...ppp...ZZZ...."

where a,b,e,f,h match one pattern, X,Y,Z match another, and the first pattern might occur up to ~20 times, before the second one appears, which always appears alone.

I'm open to solutions with the sqlglot, sql-metadata, or sqlparse libraries as well, it is just regex is better documented.

(Probably I'm code golfing, and I should do this in several steps, starting with splitting the string into individual expressions.)

score 2 · Accepted Answer · answered Aug 29 '22 at 03:28

2

You may use this regex with 3 capture and 1 non-capture groups:

(\w+)\.+(\w+)(?:\.+(\w+))?

RegEx Demo

Code:

import re
s = '...aaa....bbb...XXX....eee...fff...YYY...hhh...ZZZ....'
print (re.findall(r'(\w+)\.+(\w+)(?:\.+(\w+))?', s))

Output:

[('aaa', 'bbb', 'XXX'), ('eee', 'fff', 'YYY'), ('hhh', 'ZZZ', '')]

answered Aug 29 '22 at 03:28

anubhava

761,203
64
569
643

Nice and simple. How can I modify this regex to match any number of pattern_1 groups? In my example there might be up to ~20 occasions before a pattern_2 appears. – Lorinc Nyitrai Aug 29 '22 at 03:59
ok then how is pattern_1 defined? Does it always start with lowercase letters where `pattern_2` starts with upper case? – anubhava Aug 29 '22 at 04:29
1

Also please keep in mind that we cannot grab repeated captured group in python `re`. e.g. you can use this regex: `([a-z]\w*(?:\.+[a-z]\w*){0,19})(?:\.+(\w+))?` but there will be only 2 captured groups per match. – anubhava Aug 29 '22 at 04:44
1

In this case your initial proposal is as close, as I can get. Thank you. – Lorinc Nyitrai Aug 29 '22 at 04:47

score 0 · Answer 2 · answered Aug 29 '22 at 02:22

Here's two regexes, one to group things by the outside pattern, and one for the inside:

(.*?)(XXX|YYY|ZZZ)
(aaa|bbb|ccc|ddd|eee|fff|ggg)

What I would suggest is matching the whole string with the first regex, and then using the second regex on the first regex's match (.*?)

By using these two regexes, your matches will be grouped first by the outer pattern and then by the inner pattern, but the regexes themselves doesn't have to be overly complicated.

capture pattern_X repeatedly, then capture pattern_Y once, then repeat until EOS

2 Answers2