Right now, I have a view with a mess of common, conditional string replacement and substitutions for an open text field - in this example, regional classification.
(Please ignore the accuracy of geography, I'm just working with historical standard assignments. Also, I know I could speed things up with REPLACE or even just cleaning the RegEx statements for lookback - I'm just asking about the variable/nesting here.)
CREATE OR REPLACE FUNCTION public.region_cleanup(record_region text)
RETURNS text
LANGUAGE sql
STRICT
AS $function$
SELECT REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(record_region,'(NORTH AMERICA\s\-\sUSA\s\-\sUSA)','USA')
,'Rest\sof\sthe\sWorld\s\-\s','')
,'NORTH\sAMERICA\s\-\sCANADA','NORTH AMERICA - Canada')
,'\&\;','&')
,'Georgia\s\-\sGeorgia','MIDDLE EAST - Georgia')
,'EUROPE - Turkey','MIDDLE EAST - Turkey')
A sample output using this function would look like this in my dataset, pulling out records impacted (some are already in the correct format):
record_region_input | record_region_output |
---|---|
NORTH AMERICA - USA - USA - NORTHEAST - Massachusetts - Boston Metro | USA - NORTHEAST - Massachusetts - Boston Metro |
NORTH AMERICA - USA - USA - MIDATLANTIC - Virginia | USA - MIDATLANTIC - Virginia |
Rest of the World - ASIA - Thailand | ASIA - Thailand |
Rest of the World - EUROPE - Portugal | EUROPE - Portugal |
Rest of the World - ASIA - China - Shanghai Metro | ASIA - China - Shanghai Metro |
Georgia - Georgia | MIDDLE EAST - Georgia |
This is... fine. Regex is needed since there's tons of variability on what may come before or after these strings, and I have a proper validation list elsewhere. This is just a bulk scrub of common historical naming issues.
The problem is where I get hundreds of these kind of "known substitutions" (100+) for things like company naming or cross-department standards. Having dozens and dozens of REGEXP_REPLACE(
nested statements makes editing/adding/dropping anything a maddening game of counting.
I'm trying to clean data within Postgres exclusively, since my current pipeline doesn't always allow for standardization prior to upload. I know how I'd tackle this cleanly outside of pure SQL, but in a 'vanilla' PostgreSQL instance (v12+) is there a better method for transforming strings for a view?
Updated with a sample input/output table using the example function.