SQL Server Removing Duplicate Rows & Managing Keys

Question

I have two tables as shown below.

The Transaction Table is a table representing a sales event between a Buyer and a Supplier. The ... in the table represents other columns such as TransactionDate, Store Information, etc.

The Contact Table is a table representing the Buyer and a Supplier's information such as name & home addresss etc.

Right now the two tables share a one-to-one relationship where BuyerID = ContactID and SellerID = ContactID.

I'm populating this table with a python script so the answer could be in python, SQL, or both.

Current State

Transaction Table

TransactionID (pk)	BuyerID	SellerID
11914	11914 Buyer	11914 Seller
11915	11915 Buyer	11915 Seller
11916	11916 Buyer	11916 Seller

Contact Table

ContactID (pk)	FirstName	LastName
11914 Buyer	Mike	Doe
11914 Seller	Janet	Myers
11915 Buyer	Janet	Myers
11915 Seller	Mike	Doe
11916 Buyer	Dave	Dirk
11916 Seller	Janet	Myers

What I'd like to do is remove the repetitive rows in the Contact Table and make the table relationships a one-to-many. Additionally if for example Janet Myers changes her last name, there would be be a new addition to the Contact Table despite all of the other columns staying the same. i.e. ignoring the ContactID column, there shouldn't be two identical rows in the ContactTable.

Future State

Transaction Table

TransactionID (pk)	BuyerID	SellerID
11914	06dfe636-4408-4abe-b902-b1ac6aab9849	58e5c2f7-d0ef-4129-a5a8-e17efaee0a1d
11915	58e5c2f7-d0ef-4129-a5a8-e17efaee0a1d	06dfe636-4408-4abe-b902-b1ac6aab9849
11916	ab99de1f-a714-4709-9a1e-f3a143b279d2	58e5c2f7-d0ef-4129-a5a8-e17efaee0a1d

Contact Table

ContactID (pk)	FirstName	LastName
06dfe636-4408-4abe-b902-b1ac6aab9849	Mike	Doe
58e5c2f7-d0ef-4129-a5a8-e17efaee0a1d	Janet	Myers
ab99de1f-a714-4709-9a1e-f3a143b279d2	Dave	Dirk

My first thought to solve this would be to create the ContactID out of a hash of all the other columns.

sha256(json.dumps(contactDict.encrypt('utf-8'))).hexdigest()

Then I'd use a MERGE statement to add the contact if it the ContactID hash didn't already exist in the table. I think this would work, however I'm having second thoughts risking a hash collision with financial data.

Other than a hash, my other thought would be a MERGE statement but with around 15 columns in the ON clause I thought that'd be very low performance.

is the last row of your Contact table being Janet a mistake/typo or an example of your data quality? (ie, is [number] expected to uniquely identify an individual and the suffix just calling out repetition?) — TZHX, Aug 31 '21 at 14:24
Janet was involved with every transaction shown, either as a Buyer or Seller. That is why Janet appears multiple times in the Contacts Table. My Goal is to only have her in the Contacts Table once, and each transaction reference the same one Janet entry. — M. Stolte, Aug 31 '21 at 14:43

SQL Server Removing Duplicate Rows & Managing Keys

0 Answers0