0

I am trying to create a new ID for a dataset which will identify and group people together when certain conditions are met, for example a shared address. I'm trying to add an autoincrement column that creates this ID for those people, so I could see person x and y have the same address so they get the same new_id, but person z may not share an address with anyone and so would have the new_id field left blank.

I think my syntax is wrong and am not sure that you can do ALTER TABLE [] ADD [] AS SELECT [] WHERE []. Can someone tell me where I'm going wrong here or how to do this properly? Sample code below. Apologies if the code formatting is bad, stackoverflow is ignoring my line breaks.

ALTER TABLE people_tab ADD new_id int autoincrement(1,1)
AS SELECT *
    FROM people_tab a
    INNER JOIN people_tab b
    ON LEFT(a.zip,5) = LEFT(b.zip,5) and a.id != b.id
        where LEN(a.address) >=10 and and LEN(b.address) >=10
Alexander Petrov
  • 13,457
  • 2
  • 20
  • 49
Alexis
  • 19
  • 6
  • Yes, you cannot do ```ALTER TABLE [] ADD [] AS SELECT [] WHERE [].``` – Pankaj Jul 06 '22 at 23:39
  • I don't mind doing it in steps, if that's what it takes. But I'm not sure how I would do that. I can't add the new column as an autoincrement, because it would populate for all rows, not just the ones that meet the criteria. Any advice? – Alexis Jul 06 '22 at 23:56
  • 1
    Consider instead adding a second table which represents addresses. Then it's a simple auto-incrementing id for the address. Join each person to their address. – Schwern Jul 07 '22 at 00:27

1 Answers1

0

Since addresses are shared, and not people, you should create a new table of addresses and have the auto-increment applied there for each unique address.

The quality of your address sources will play heavily into this - if you are using an address lookup service to improve your data quality, then you're headed in the right direction. Otherwise, you may be heading into a game of whack-a-mole correcting everything from misspellings, case insensitivity, variations on standard names (like RD, Road, Rd.) that mean the same thing, and others.

Assuming your data is "clean", then I would create a table with address and zip columns:

create or replace table address_master 
(id int autoincrement(1,1), 
 zip string, 
 address string );
    
insert into address_master (zip, address) 
select distinct zip, address from people_tab;

Now you can join your people_tab on address_master and get the id associated with all people matching the same address.

Jim Demitriou
  • 593
  • 4
  • 8