Find "regional" relationships in SQL data using a query, or SSIS

Question

Edit for clarification: I am compiling data weekly, based on Zip_Code, but some Zip_Codes are redundant. I know I should be able to compile a small amount of data, and derive the redundant zip_codes if I can establish relationships.

I want to define a zip code's region by the unique set of items and values that appear in that zip code, in order to create a "Region Table"

I am looking to find relationships by zip code with certain data. Ultimately, I have tables which include similar values for many zip codes.

I have data similar to:

ItemCode   |Value  | Zip_Code
-----------|-------|-------
1          |10     | 1
2          |15     | 1
3          |5      | 1 
1          |10     | 2
2          |15     | 2
3          |5      | 2
1          |10     | 3
2          |10     | 3
3          |15     | 3

Or to simplify the idea, I could even concantenate ItemCode + Value into unique values:

ItemCode+
Value    | Zip_Code
A        | 1
B        | 1
C        | 1
A        | 2
B        | 2
C        | 2
A        | 3
D        | 3
E        | 3

As you can see, Zip_Code 1 and 2 have the same distinct ItemCode and Value. Zip_Code 3 however, has different values for certain ItemCodes.

I need to create a table that establishes a relationship between Zip_Codes that contain the same data.

The final table will look something like:

Zip_Code | Region
1        | 1
2        | 1
3        | 2
4        | 2
5        | 1
6        | 3
...etc

This will allow me to collect data only once for each unique Region, and derive the zip_code appropriately.

Things I'm doing now:

I am currently using a query similar to a join, and compares against Zip_Code using something along the lines of:

SELECT a.ItemCode
      ,a.value
      ,a.zip_code
      ,b.ItemCode
      ,b.value
      ,b.zip_code
FROM mytable as a, mytable as b -- select from table twice, similar to a join
WHERE a.zip_code = 1  -- left table will have all ItemCode and Value from zip 1
  AND b.zip_code = 2  -- right table will have all ItemCode and Value from zip 2
  AND a.ItemCode = b.ItemCode -- matches rows on ItemCode 
  AND a.Value != b.Value
ORDER BY ItemCode

This returns nothing if the two zip codes have exactly the same ItemNum, and Value, and returns a slew of differences between the two zip codes if there are differences.

This needs to move from a manual process to an automated process however, as I am now working with more than 100 zip_codes.

I do not have much programming experience in specific languages, so tools in SSIS are somewhat limited to me. I have some experience using the Fuzzy tools, and feel like there might be something in Fuzzy Grouping that might shine a light on apparent regions, but can't figure out how to set it up.

Does anyone have any suggestions? I have access to SQLServ and its related tools, and Visual Studio. I am trying to avoid writing a program to automate this, as my c# skills are relatively nooby, but will figure it out if necessary.

Sorry for being so verbose: This is my first Question, and the page I agreed to in order to ask a question suggested to explain in detail, and talk about what I've tried... Thanks in advance for any help I might receive.

While you did put a lot of detail into the question, the ultimate aim of what you are trying to achieve here is still pretty vague. What defines a region, and how do we know (given an Item Code and a Value, or just a Value) what Region a Zip Code is in? — Chris Shain, Mar 01 '12 at 17:37
Sorry about that. Ultimately, my goal is to analyze and consolidate the data. Assuming Zip_Code 1 and 2 have the same sets of ItemCode and Value, then Zip_Code 1 and 2 could be considered the same region. It is common within this data set to have the same ItemCodes with different Values, and this cannot be considered the same region. Each ItemCode will only appear once in each Zip_Code. The regions themselves are completely arbitrary, and more for the sake of grouping zip_codes together. — Mr. Tim, Mar 01 '12 at 20:33
So based on that logic, it seems that a zip code could be in multiple regions, right? Because "**It is common within this data set to have the same ItemCodes with different Values**", "**ItemCodes with different Values ... cannot be considered the same region**", and "**Each ItemCode will only appear once in each Zip_Code**" — Chris Shain, Mar 01 '12 at 21:32
Hmm. I'm sorry I'm having a hard time describing this... Let me try it this way: Currently I collect ItemCode and Value for 10 zip codes. I know that there are actually only 2-3 unique sets of ItemCode and Value contained within, but I do not know exactly which zip codes this relationship occurs in. I'd much rather only collect these 2-3 unique sets, one set of data for each "region" -- and derive the zip_code from this final table. The data is updated regularly, and it is intensive to create the data for every zip code. — Mr. Tim, Mar 01 '12 at 21:53
I think the big thing I didn't specify is that this is not static data, but is gathered weekly, where the Value will change from week to week. In my above example, Zip_Code 1 & 2 have the same data, so really I could grab only zip_code 1, and derive the data for zip_code 2 from this. Zip_Code 3 has different values for the same ItemCode, so needs to be compiled regardless. Before I consolidate, I need to find the relations between the zip_codes to create this final table. Thank you for attempting to look at this, despite the confusion. — Mr. Tim, Mar 01 '12 at 22:04
I think that what you are saying just dawned on me- you want to define a zip code's region by **the unique set of items and values that appear in that zip code**. I'll try to work up a solution based on that assumption. — Chris Shain, Mar 01 '12 at 22:09

score 1 · Accepted Answer · answered Mar 02 '12 at 16:09

Give this a shot (I used the simplified example, but this can easily be expanded). I think the real interesting part of this code is the recursive CTE...

;with matches as (
    --Find all pairs of zip_codes that have matching values.
    select d1.ZipCode zc1, d2.ZipCode zc2
    from data d1
        join data d2 on d1.Val=d2.Val
    group by d1.ZipCode, d2.ZipCode
    having count(*) = (select count(distinct Val) from data where zipcode = d1.Zipcode)
), cte as (
    --Trace each zip_code to it's "smallest" matching zip_code id.
    select zc1 tempRegionID, zc2 ZipCode
    from matches
    where zc1<=zc2
    UNION ALL
    select c.tempRegionID, m.zc2
    from cte c
        join matches m on c.ZipCode=m.zc1
                      and c.ZipCode!=m.zc2
    where m.zc1<=m.zc2
)
--For each zip_code, use it's smallest matching zip_code as it's region.
select zipCode, min(tempRegionID) as regionID
from cte
group by ZipCode

I am far from some ereeto bambito code slinger, so I appreciate the input I've had from the folks here. This example in particular was inspiring as I read through and tested the various stages. I thought it was very clever, and helped me learn. I was able to successfully integrate it into my bigger set of "test data" -- the next stage is to try it on the master set of data (32,000 records, 32 zip codes :D ). Thank you for sharing your sql knowledge with me, and everyone. — Mr. Tim, Mar 02 '12 at 21:40

score 1 · Answer 2 · answered Mar 02 '12 at 16:20

Demonstrating that there's a use for everything, though normally it makes me cringe: concatenate the values for each zip code into a single field. Store ZipCode and ConcatenatedValues in a lookup table (PK on the one, UQ on the other). Now you can assess which zip codes are in the same region by grouping on ConcatenatedValues.

Here's a simple function to concatenate text data:

CREATE TYPE dbo.List AS TABLE
(
Item    VARCHAR(1000)
)
GO

CREATE FUNCTION dbo.Implode (@List dbo.List READONLY, @Separator VARCHAR(10) = ',') RETURNS VARCHAR(MAX)
AS BEGIN
    DECLARE @Concat VARCHAR(MAX)
    SELECT @Concat = CASE WHEN Item IS NULL THEN @Concat ELSE COALESCE(@Concat + @Separator, '') + Item END FROM @List
    RETURN @Concat
END
GO

DECLARE @List AS dbo.List
INSERT INTO @List (Item) VALUES ('A'), ('B'), ('C'), ('D')
SELECT dbo.Implode(@List, ',')

Thank you for taking the time to look at my question. My noobiness is preventing me from expanding this into the actual data set, but it definitely helped me think about the data a little differently. I appreciate your input! — Mr. Tim, Mar 02 '12 at 21:37

Find "regional" relationships in SQL data using a query, or SSIS

2 Answers2