I have two tables as shown below.
The Transaction Table is a table representing a sales event between a Buyer and a Supplier. The ... in the table represents other columns such as TransactionDate, Store Information, etc.
The Contact Table is a table representing the Buyer and a Supplier's information such as name & home addresss etc.
Right now the two tables share a one-to-one relationship where BuyerID = ContactID and SellerID = ContactID.
I'm populating this table with a python script so the answer could be in python, SQL, or both.
Current State
Transaction Table
TransactionID (pk) | ... | BuyerID | SellerID |
---|---|---|---|
11914 | 11914 Buyer | 11914 Seller | |
11915 | 11915 Buyer | 11915 Seller | |
11916 | 11916 Buyer | 11916 Seller |
Contact Table
ContactID (pk) | ... | FirstName | LastName |
---|---|---|---|
11914 Buyer | Mike | Doe | |
11914 Seller | Janet | Myers | |
11915 Buyer | Janet | Myers | |
11915 Seller | Mike | Doe | |
11916 Buyer | Dave | Dirk | |
11916 Seller | Janet | Myers |
What I'd like to do is remove the repetitive rows in the Contact Table and make the table relationships a one-to-many. Additionally if for example Janet Myers changes her last name, there would be be a new addition to the Contact Table despite all of the other columns staying the same. i.e. ignoring the ContactID column, there shouldn't be two identical rows in the ContactTable.
Future State
Transaction Table
TransactionID (pk) | ... | BuyerID | SellerID |
---|---|---|---|
11914 | 06dfe636-4408-4abe-b902-b1ac6aab9849 | 58e5c2f7-d0ef-4129-a5a8-e17efaee0a1d | |
11915 | 58e5c2f7-d0ef-4129-a5a8-e17efaee0a1d | 06dfe636-4408-4abe-b902-b1ac6aab9849 | |
11916 | ab99de1f-a714-4709-9a1e-f3a143b279d2 | 58e5c2f7-d0ef-4129-a5a8-e17efaee0a1d |
Contact Table
ContactID (pk) | ... | FirstName | LastName |
---|---|---|---|
06dfe636-4408-4abe-b902-b1ac6aab9849 | Mike | Doe | |
58e5c2f7-d0ef-4129-a5a8-e17efaee0a1d | Janet | Myers | |
ab99de1f-a714-4709-9a1e-f3a143b279d2 | Dave | Dirk |
My first thought to solve this would be to create the ContactID out of a hash of all the other columns.
sha256(json.dumps(contactDict.encrypt('utf-8'))).hexdigest()
Then I'd use a MERGE
statement to add the contact if it the ContactID hash didn't already exist in the table. I think this would work, however I'm having second thoughts risking a hash collision with financial data.
Other than a hash, my other thought would be a MERGE
statement but with around 15 columns in the ON
clause I thought that'd be very low performance.