Context: i'm working on a table that contains 2 fields that may include geo-identifying data.
The fields can either contain NUTS codes or Postcodes (even partial). (https://en.wikipedia.org/wiki/Nomenclature_of_Territorial_Units_for_Statistics).
Now I have a second table that includes all (or most) possibles cases for these and i'm trying to match them.
If a match is made, save the NUTS1/2/3 into the corresponding fields.
Examples :
Country | First_Field | Second_Field | Output |
---|---|---|---|
FR | NULL | 75324 | Paris |
FR | FR101 | NULL | Paris |
FR | FRZZZ | NULL | Extra Region NUTS3 |
FR | NULL | 94 | Paris |
DE | NULL | 94 | Passau Kreisfreie Stadt |
DE | NULL | 936 | NULL |
Current version
-- Currently i'm updating an existing table, but i could create a copy and add the fields during creation
update src_table
SET geo_1 = COALESCE(tripleZPost.geo_1, directMatch.geo_1
,purgedPostFirstThree.geo_1, purgedPostFirstTwo.geo_1
)
,geo_2 = COALESCE(tripleZPost.geo_2, directMatch.geo_2
,purgedPostFirstThree.geo_2, purgedPostFirstTwo.geo_2
)
,geo_3 = COALESCE(tripleZPost.geo_3, directMatch.geo_3
,purgedPostFirstThree.geo_3, purgedPostFirstTwo.geo_3
)
FROM src_table as mainTable
-- Check if it's a generic NUTS code (FRZZZ, DEZZZ, ESZZZ)
LEFT JOIN geocodeIndex as tripleZPost on (
tripleZPost.country = mainTable.COUNTRY
and mainTable.FIRST_FIELD = mainTable.COUNTRY + 'ZZZ'
and mainTable.FIRST_FIELD = tripleZPost.geocode)
-- Check if we have a full match
LEFT JOIN geocodeIndex as directMatch on (
directMatch.country = mainTable.COUNTRY
and (
mainTable.FIRST_FIELD = directMatch.geocode
or
mainTable.SECOND_FIELD = directMatch.geocode)
)
-- If no full match is found, try a partial.
-- 75324 is a full postcode, we only keep the first 2-3 characters
LEFT JOIN geocodeIndex as purgedPostFirstTwo on (
purgedPostFirstTwo.country = mainTable.COUNTRY
and LEFT(SECOND_FIELD, 2) = purgedPostFirstTwo.geocode
)
LEFT JOIN geocodeIndex as purgedPostFirstThree on (
purgedPostFirstThree.country = mainTable.COUNTRY
and LEFT(SECOND_FIELD, 3) = purgedPostFirstThree.geocode
)
My inquiry now is two-fold :
- Is there a cleaner/more efficient way of doing multiple checks like these without using LEFT-JOIN's ?
- I've used coalesce to take the first non-NULL entry here to make a priority system, is there a way to run the LEFT-joins only if the previous entry didn't result in a match ?
Keep in mind, my table is around 800 million entries so while performance is a concern space/query runtime isn't, I can run it overnight, I just can't make my server crash.