0

Context: i'm working on a table that contains 2 fields that may include geo-identifying data.

The fields can either contain NUTS codes or Postcodes (even partial). (https://en.wikipedia.org/wiki/Nomenclature_of_Territorial_Units_for_Statistics).

Now I have a second table that includes all (or most) possibles cases for these and i'm trying to match them.

regionDataTable

If a match is made, save the NUTS1/2/3 into the corresponding fields.

Examples :

Country First_Field Second_Field Output
FR NULL 75324 Paris
FR FR101 NULL Paris
FR FRZZZ NULL Extra Region NUTS3
FR NULL 94 Paris
DE NULL 94 Passau Kreisfreie Stadt
DE NULL 936 NULL

Current version

-- Currently i'm updating an existing table, but i could create a copy and add the fields during creation
update src_table
SET geo_1 =  COALESCE(tripleZPost.geo_1, directMatch.geo_1
            ,purgedPostFirstThree.geo_1, purgedPostFirstTwo.geo_1
            )
    ,geo_2 =  COALESCE(tripleZPost.geo_2, directMatch.geo_2
            ,purgedPostFirstThree.geo_2, purgedPostFirstTwo.geo_2
            )
    ,geo_3 =  COALESCE(tripleZPost.geo_3, directMatch.geo_3
            ,purgedPostFirstThree.geo_3, purgedPostFirstTwo.geo_3
            )
FROM src_table as mainTable
-- Check if it's a generic NUTS code (FRZZZ, DEZZZ, ESZZZ)
LEFT JOIN geocodeIndex as tripleZPost on (
    tripleZPost.country = mainTable.COUNTRY
    and mainTable.FIRST_FIELD = mainTable.COUNTRY + 'ZZZ'
    and mainTable.FIRST_FIELD = tripleZPost.geocode)
-- Check if we have a full match
LEFT JOIN geocodeIndex as directMatch on (
    directMatch.country = mainTable.COUNTRY 
    and (
        mainTable.FIRST_FIELD = directMatch.geocode
        or
        mainTable.SECOND_FIELD = directMatch.geocode)
)
-- If no full match is found, try a partial.
-- 75324 is a full postcode, we only keep the first 2-3 characters
LEFT JOIN geocodeIndex as purgedPostFirstTwo on (
    purgedPostFirstTwo.country = mainTable.COUNTRY 
    and LEFT(SECOND_FIELD, 2) = purgedPostFirstTwo.geocode
)
LEFT JOIN geocodeIndex as purgedPostFirstThree on (
    purgedPostFirstThree.country = mainTable.COUNTRY 
    and LEFT(SECOND_FIELD, 3) = purgedPostFirstThree.geocode
)

My inquiry now is two-fold :

  • Is there a cleaner/more efficient way of doing multiple checks like these without using LEFT-JOIN's ?
  • I've used coalesce to take the first non-NULL entry here to make a priority system, is there a way to run the LEFT-joins only if the previous entry didn't result in a match ?

Keep in mind, my table is around 800 million entries so while performance is a concern space/query runtime isn't, I can run it overnight, I just can't make my server crash.

ratouney
  • 49
  • 2
  • 9
  • For a clause like `LEFT(SECOND_FIELD, 2) = purgedPostFirstTwo.geocode` a `LIKE` would be a SARGable solution: `SECOND_FIELD LIKE geocode + '%'` – Thom A May 09 '23 at 08:39
  • This is tricky. How big is the "nuts" tables? I'm not sure there's much better way than what you're doing, since you have so many rows. But it might be better to batch up the processing by doing parts of your large table, perhaps partitioned by some ID or similar. This way, you can resume your work easier. – siggemannen May 09 '23 at 11:22
  • the NUTS table is only 40k rows. And right now the only way to run the geo matching is indeed by segmenting it by country/year (FR2015-FR2017, FR2017-FR2020, etc...) I was doing it like this because I just needed it to work, but I was sure there would be a "cleaner" way of going at it... – ratouney May 09 '23 at 12:13

0 Answers0