What is wrong with this join of a table to itself?

Question

I have a table called TempAllAddresses with the following columns - ID, Address, State. I want to populate a new table with Address, State, and Count. Count should represent how many records there are in the TempAllAddresses table that have an address like this address followed by a wildcard. If that made no sense, here's an example to illustrate - Let's say I have a record like this:

ID      Address      State
12345   13 Phoenix   NY

What I want to do is insert a new record into a new table called AddressCount that has 13 Phoenix for the Address, NY for the State, and the number of records in the table that have NY as the State and an address LIKE '13 Phoenix%' for the Count.

I want to accomplish this with an inner join of TempAllAddresses on itself. This is what I've tried, but it doesn't seem to accomplish what I'm looking for:

SELECT t1.Address, t1.State, COUNT(t2.address) As NumEntities
FROM TempAllAddresses t1
INNER JOIN TempAllAddresses t2
 ON t1.state = t2.state
 AND T2.Address LIKE t1.address + '%'
GROUP BY t1.State, t1.Address

The Count is definitely off, though. It should be equivalent to running "SELECT COUNT(*) FROM TempAllAddresses WHERE State=thisRecordsState and Address LIKE thisRecordsAddress + '%'". How can I accomplish this? What am I doing wrong?

Edit:

The count seems to be off in the following way - If I have a record like I mentioned above, and then I have 2 other records that also have a state of NY, and then have addresses of "13 Phoenix Road" and "13 Phoenix Rd", then I want to get in my final table a record like this:

13 Phoenix    NY    3

Instead, I seem to be getting:

13 Phoenix    NY    9

I'm not quite sure what's happening here... some sort of cartesian product? Permutations...? Can anyone explain this?

Edit 2: A further edit since I seem to be misunderstood (and really need a solution :( )... Here is a query with a correlated subselect that accomplishes what I'm looking for. I'd like to do the same thing with an inner join of the table on itself rather than a subselect.

SELECT Address, State, 
    (SELECT Count(*)
    FROM TempAllAddresses innerQry 
    WHERE innerQry.address LIKE outerQry.address + '%' 
        AND innerQry.state = outerQry.state) As NumEntities
FROM TempAllAddresses outerQry

Basically, for each record, I want to get the number of records in the table that have the same state and an address that begins with this address (or is equal to... I do want to include this address as part of the count).

How is the count off? Should each row count itself (all counts >= 1)? — Damien_The_Unbeliever, Dec 13 '10 at 09:41
it actually looks correct. if you don't summarize (i.e. group) is the result what you expect (for t2.address)? — lijie, Dec 13 '10 at 09:43
sorry, I'm a little fuzzy about this, so I'm not sure if I completely understand your comments... I'll try to update the question to explain how the count is off — froadie, Dec 13 '10 at 09:46
I've created a temp table, put the three rows you've so far mentioned (Phoenix, Phoenix Rd, Phoenix Road) in, and run your original query - and I get COUNTS of (3, 1, 1), respectively, which I believe are correct. — Damien_The_Unbeliever, Dec 13 '10 at 10:35
You should provide DDL for [TempAllAddresses] as well as a minimum list of rows (15? ¿20?) so that others can repeat your tests and see exactly what you see - with that sample data, you could say "I'm getting this and this using query A, this and this using query B - none are right, I need to get this results instead". — Joe Pineda, Dec 28 '10 at 16:35
Shouldn't city be part of this? You might find 10 Main Street in hundreds of cities in a large state. — HLGEM, Dec 31 '10 at 15:59

score 1 · Answer 1 · answered Dec 31 '10 at 22:18

Here's two solutions, one using a CROSS APPLY and the other using an INNER JOIN like you wanted originally. I hope this helps. :)

DECLARE @TempAllAddresses TABLE
(
    ID INT PRIMARY KEY IDENTITY(1, 1) NOT NULL
    , [Address] VARCHAR(250) NOT NULL
    , [State] CHAR(2) NOT NULL
)

INSERT INTO @TempAllAddresses
VALUES ('13 Phoenix', 'NY')
        , ('13 Phoenix St', 'NY')
        , ('13 Phoenix Street', 'NY')
        , ('1845 Test', 'TN')
        , ('1337 Street', 'WA')
        , ('1845 T', 'TN')

SELECT
    TempAddresses.ID
    , TempAddresses.[Address]
    , TempAddresses.[State]
    , TempAddressesCounted.AddressCount
FROM @TempAllAddresses TempAddresses
CROSS APPLY
(
    SELECT
        COUNT(*) AS AddressCount
    FROM @TempAllAddresses TempAddressesApply
    WHERE TempAddressesApply.[Address] LIKE (TempAddresses.[Address] + '%')
        AND TempAddressesApply.[State] = TempAddresses.[State]
) TempAddressesCounted

SELECT
    TempAddresses.ID
    , TempAddresses.[Address]
    , TempAddresses.[State]
    , COUNT(*) AS AddressCount
FROM @TempAllAddresses TempAddresses
INNER JOIN @TempAllAddresses TempAddressesJoin
    ON TempAddressesJoin.[Address] LIKE (TempAddresses.[Address] + '%')
            AND TempAddressesJoin.[State] = TempAddresses.[State]
GROUP BY TempAddresses.ID
    , TempAddresses.[Address]
    , TempAddresses.[State]

Guillem Vicens · Answer 2 · 2010-12-13T10:11:51.420

0

Try this instead:

SELECT Orig_Address, State, COUNT(Similar_Address)
From
(
  SELECT t1.Address Orig_Address, 
         t1.State   State, 
         t2.address Similar_Address
    FROM TempAllAddresses t1
   INNER JOIN TempAllAddresses t2
      ON t1.state = t2.state
     AND T2.Address LIKE t1.address + '%'
     AND t1.address <> t2.address
)
GROUP BY State, Orig_Address

EDIT: forgot to include the difference between t1.address and t2.address, as @Spiny Norman said, since you probably do not want to compare an address to itself.

HTH

edited Dec 13 '10 at 10:11

answered Dec 13 '10 at 09:40

Guillem Vicens

3,936
30
44

and how is this different from grouping by t1.address in the first place?? – lijie Dec 13 '10 at 09:45
@lijie, sorry, you are right. I wrote that in the first place, then realized it was wrong and just wanted to put my select proposal, but forgot to delete that part of the text. I'll edit it accordingly. – Guillem Vicens Dec 13 '10 at 09:56

Spiny Norman · Answer 3 · 2010-12-13T10:34:51.723

0

EDIT: [snip old stuff]

Try this:

SELECT t1.Address, t1.State, COUNT(distinct t2.id) As NumEntities
FROM TempAllAddresses t1
INNER JOIN TempAllAddresses t2
 ON t1.state = t2.state
 AND T2.Address LIKE t1.address + '%'
GROUP BY t1.State, t1.Address

edited Dec 13 '10 at 10:34

answered Dec 13 '10 at 09:57

Spiny Norman

8,277
1
30
55

Trying your first suggestion, I get 0 for NumEntities for all rows... I'd assume this is because t1.Address <> t1.Address is always false. Is there a typo here somewhere? – froadie Dec 13 '10 at 10:06
Yep :) I mean t1.Address <> t2.address. Edited. (by the way, I think the '_%' solution is more elegant, if your database server supports it). – Spiny Norman Dec 13 '10 at 10:08
This doesn't seem to be giving me the correct results at all... I'm checking some random results and they're totally off, but not in any noticeable pattern. Maybe it has something to do with the left join? I'm not quite sure why you think this is a solution to the problem so I can't really figure out how to tweak it... – froadie Dec 13 '10 at 10:18
1

Hey, what seems to work for me is to use your original query, but use `count(distinct t2.id)` instead of `count(t2.address)`. – Spiny Norman Dec 13 '10 at 10:33
1

No, `t2.id`. This way, all matching rows will be counted (once). It doesn't matter much, it depends if you want to count separate addresses that sound the same (same street name in a different town in the state, I guess) as one address or not. – Spiny Norman Dec 13 '10 at 10:51

score 0 · Answer 4 · answered Dec 13 '10 at 10:55

0

QUERY A:

SELECT t1.Address, t1.State, COUNT(t2.address) As NumEntities
FROM TempAllAddresses t1
INNER JOIN TempAllAddresses t2
 ON t1.state = t2.state
 AND T2.Address LIKE t1.address + '%'
GROUP BY t1.State, t1.Address

is not equivalent to

QUERY B:

SELECT Address, State, 
    (SELECT Count(*)
    FROM TempAllAddresses innerQry 
    WHERE innerQry.address LIKE outerQry.address + '%' 
        AND innerQry.state = outerQry.state) As NumEntities
FROM TempAllAddresses outerQry

because B produces 1 row for each row in the original table (TempAllAddresses), whereas A will group together rows in the original table that have the same state and address. To solve this, GROUP BY t1.ID, t1.State, t1.Address instead.

answered Dec 13 '10 at 10:55

lijie

4,811
22
26

@lijie agreed, but I don't think even the subselect version really does what froadie wants despite what he says - I think it is more likely he wants 1 row for each distinct address, state – Dec 14 '10 at 10:39
@jackpdouglas: yes i would think that that is the actual intention (but that's still doing mind-reading until OP confirms). for that purpose, would just adding `DISTINCT` after the `SELECT` keyword suffice (I don't have enough experience to know, but that's what I would naturally try, if I didn't want subqueries). – lijie Dec 14 '10 at 10:51
@lijie that would get the 'right' number of rows but the `count` would not work - see my answer for what I would do instead – Dec 14 '10 at 13:05
@jack: what would happen to count? does double counting occur? but that would mean distinct is treated as a group? – lijie Dec 14 '10 at 16:42
@lijie as far as I know you cannot use `distinct` and an aggregate like `count` in the same `select` on any RDBMS. You can nest of course, which is what I do in my answer (and you can do things like `count(distinct...` but I don't think that is what you mean) – Dec 14 '10 at 16:55
@jack: ah! ok i learnt something new today. but the limitation seems quite weird, i must say. is it a practical limitation? or a language (sql) limitation? – lijie Dec 14 '10 at 17:08
@lijie I think I misunderstood you first question - looking back I think you must have meant adding a `distinct` to froadie's query with the subselect rather than yours like I thought. In which case I think you were right - that would return what I think froadie really wants. Sorry. – Dec 14 '10 at 17:42
@jack: ah ok. so i guess it is fair to say that in cases where group by can return multiple rows (because we group by more fields than projected), distinct can still be used to eliminate duplicates? – lijie Dec 14 '10 at 17:46

score 0 · Answer 5 · answered Dec 13 '10 at 13:50

There is double counting going on when there are multiple rows with exactly the same address.

Try:

SELECT t1.Address, t1.State, COUNT(t2.address) As NumEntities
FROM (select distinct Address, State from TempAllAddresses) t1
INNER JOIN TempAllAddresses t2
 ON t1.state = t2.state
 AND T2.Address LIKE t1.address + '%'
GROUP BY t1.State, t1.Address

score 0 · Answer 6 · answered Dec 14 '10 at 10:11

Nested GroupBy:

The subquery will find the shortest address for each distinct address.
This doesn't consider case sensitivity.
Then each version of these addresses are counted.

SQL:

SELECT Address, State, count(1) As NumEntities
FROM ( 
    SELECT min(t1.Address) as Address, t1.State
    FROM TempAllAddresses t1
    INNER JOIN TempAllAddresses t2
     ON t1.state = t2.state
     AND T2.Address LIKE t1.address + '%'
    GROUP BY t1.State, t2.Address
) GROUP By State, Address

score 0 · Answer 7 · answered Dec 31 '10 at 15:39

0

Have you tried analytical functions - they are often the easiest solution. I am not familiar with your table structure, but it should be something like this:

SELECT t1.Address, t1.State, 
COUNT(t2.address) OVER (PARTITION BY t2.state) As NumEntities
FROM TempAllAddresses t1
INNER JOIN TempAllAddresses t2
ON t1.state = t2.state
AND T2.Address LIKE t1.address + '%'
GROUP BY t1.State, t1.Address

You can even add ORDER BY in the OVER clause. See Oracle FAQs for some explanation.

answered Dec 31 '10 at 15:39

RapidCoder

73
1
4

I haven't worked with Oracle before so I can't say if this will work there but in SQL Server this is invalid because t2.address and t2.state are not contained in the GROUP BY clause. – Jeremy Pridemore Dec 31 '10 at 22:27
@Jeremy: SQL Server is a completly different thing. This part of SQL is Oracle specific. – RapidCoder Jan 01 '11 at 16:24

What is wrong with this join of a table to itself?

7 Answers7