0

I have two tables like this:

Summary Table

ID int
CityName Varchar(200)
PopulationCount int

Data Table

ID int
CityName Varchar(200)
PopulationCount int
Longitude float
Latitude float

The summary table has 800 rows and the DataTable has 800,000 rows. Each CityName in the Summary Table is in the Data Table at least once (and quite often multiple times). I need to create a new table to hold each of the 800 cities from the summary table and their corresponding Latitudes/Longitudes from the Data Table.

The problem is the CityName field in the summary table isn't an exact match to the CityName in the Data Table. There are slight differences.

   Sample Data: 
   Summary Table, CityName: Yarmouth (N.S.)
   Data Table, CityName: Yarmouth

The pattern in the sample data above doesn't always hold true i.e. the difference is not always a bracket so I dont see regex etc working. I imagine a contains() or like '% CityName %' could possibly work but I don't know how to implement them correctly.

Any help would be greatly appreciated...

pirho
  • 11,565
  • 12
  • 43
  • 70
Bagsy
  • 23
  • 5
  • 3
    Ok, so the condition to do the JOIN is one problem, but you also said that a City can have more than one record on the data table. If so, which Latitude and Longitude would you want to put on your summary table?, how do you choose one record over the other? – Lamak Jan 24 '12 at 15:38
  • What do you expect to happen if there are multiple *different* lat/longs in the data table for a given city? – Chris Shain Jan 24 '12 at 15:38
  • 2
    OK here is where you have to step back and look at your design which is inherently unworkable. For on thing look up how many cities named Columbus there are in differnt locations in the US. City is not a unique identifier even if you have the correct name. Next city name should be stored in one and only one table and cityid should be in all related tables. If you have set this up from the start with proper relationships and FKS you would not now have this problem. There is no fix for this but to fix your data model and store data in a way that it can successfully be queried. – HLGEM Jan 24 '12 at 15:48
  • Thanks all for the answers.... @Lamak - not interested which Lat/Lon comes back really... They're all in a general area and tha level of granularity is ok for my goal. HLGEM... you've got me. It seems I have at least 7 duplications in the Summary Table so like you said the city name is by no means unique. And for the record.. no i didnt set this up from the start. What I'm trying to get is a table with all cities in Canada with population > 5k with corresponding Lat/Lon. – Bagsy Jan 24 '12 at 15:59
  • contd... I have a table with every zipcode in Canada (>800k records!) but that is way too big for my purposes. I have a second table with 800 records of all cities > 5k in Canada so somehow I need to get Lat/Lon into that table. Back to the drawing board I think... Thanks for saving me some time though! – Bagsy Jan 24 '12 at 15:59
  • @AndriyM - yes its a one off task. I've massaged the data a little further so now I have a state code in each table. Using this and the city name all records in my summary table are now unique. Using a join where dataTable.City = summary.City and dataTable.StateCode = summary.stateCode I get about 70% match. Using a join to include dataTable.City like '%' summary.City '%' allows too many false positives... – Bagsy Jan 26 '12 at 09:27

2 Answers2

0

Using 'like' would look something like this:

select * 
from Summary s
     join Data d on (s.CityName like '%' + d.CityName + '%') or (d.CityName like '%' + s.CityName + '%')

But like you said, that may not capture everything

Dan Suceava
  • 157
  • 3
0

I am not sure if this would be the best approach, but given it's a one-off job, as you confirmed in a comment, I would definitely try the following:

  1. Create an intermediate table CityNames with columns like this:

    • StateCode (your state code),

    • CityNameVariant,

    • CityNameProper.

  2. Populate the first two columns with all the distinct combinations of StateCode and CityName found in both source tables.

  3. Using a series of manual updates, fill the CityNameProper column with names stripped off the various non-relevant parts like (N.S.) or , N.S. etc., so as to get a single uniform name matching its various versions, for every city. So the table would contain rows like this:

    StateCode  CityNameVariant  CityNameProper
    ---------  ---------------  --------------
    NS         Yarmouth (N.S.)  Yarmouth
    NS         Yarmouth         Yarmouth
    NS         Yarmouth, N.S.   Yarmouth
    

    Update statements might look like this:

    UPDATE CityNames
    SET CityNameProper = SUBSTRING(
      CityNameVariant,
      1,
      CHARINDEX('(', CityNameVariant)
    )
    WHERE CityNameVariant LIKE '% (%)'
    ;
    UPDATE CityNames
    SET CityNameProper = SUBSTRING(
      CityNameVariant,
      1,
      CHARINDEX(',', CityNameVariant)
    )
    WHERE CityNameVariant LIKE '%, %'
    ;
    UPDATE CityNames
    SET CityNameProper = SUBSTRING(CityNameVariant, …, …)
    WHERE CityNameVariant LIKE '…'
    ;
    
  4. The final query matching the two original tables would then look something like this:

    WITH joined AS (
      SELECT
        s.StateCode,
        s.CityName,
        d.PopulationCount,
        d.Longitude,
        d.Latitude,
        rnk = ROW_NUMBER() OVER (
          PARTITION BY s.StateCode, s.CityName
          ORDER BY d.Population  /* or maybe ‘BY Longitude, Latitude’
                                    or ‘BY (SELECT 1)’, it may not matter
                                    but in any event it's up to you */
        )
      FROM Summary s
        INNER JOIN CityNames c ON s.StateCode = c.StateCode
                              AND s.CityName  = c.CityNameVariant
        INNER JOIN Data d      ON d.StateCode = c.StateCode
                              AND d.CityName  = c.CotyNameProper
    )
    --INSERT INTO SummaryWithData
    SELECT
      StateCode,
      CityName,
      PopulationCount,
      Longitude,
      Latitude
    FROM joined
    WHERE rnk = 1
    

That is, match all the cities in Summary with those in Data and pick for each in the former one row from the latter.

Andriy M
  • 76,112
  • 17
  • 94
  • 154