0

I have two data sources that are loaded into Azure Synapse. Both raw data sources contain an 'Apple' table.

I merge these into a single 'Apple' table in my Enriched data store.

SELECT * FROM datasource1.apple JOIN datasource2.apple on datasource1.apple.id = datasource2.apple.id

However, both data sourecs also contain a one to many relation AppleColours.

Please could someone help me understand the correct approach to creating a single AppleColours table in my enriched zone?

atoms
  • 2,993
  • 2
  • 22
  • 43
  • data sources means you are referring to dedicated sql pool? – Pratik Lad Nov 18 '22 at 04:01
  • raw data is loaded into Azure Gene 2 srotage as parquet files. These need to be merged for the Enriched store. I'm not set on what data store to use as the enriched store – atoms Nov 18 '22 at 08:30

3 Answers3

0

You need data from both sources when you want to merge them. JOIN(INNER JOIN) will bring only the apple.id that is in both datasource1 and datasource2. You should try a CROSS JOIN

For the AppleColours 1 to many relation there are 2 methods:

  1. You could put direct the color in the Apple table in this case there is no need for a separate AppleColours table Apple ID| Color 1 | red 2 | green

To get data into Color column make another JOIN this time with the AppleColours on the colorID from Apple table and AppleColours

  1. Create a separate table AppleColours with ID and color. IN this table import both AppleColours tables from both datasources using a CROSS JOIN Add a column in Apple table named AppleColorId which has the id's from AppleColours

If you want an Applet table that has all the data and don't need any join's to determine the apple color use method 1. If you want a 'slim' apple table which has minimal data inside use method 2. In this case to get the apple color you have to make an extra JOIN(INNER JOIN) to the AppleColour table

Musti
  • 1
  • 1
  • Thanks for your answer. So if im correct for method 2 it would result in the destination having a single table of 'AppleColours', with columns: `id, applecolours_id_source_1, applecolours_id_source_2, colour_value`. Then when I populate an apple record I just have to lookup the id in the AppleColours table using the respective _soruce_x column? – atoms Nov 28 '22 at 16:06
  • Hi, you should not do 2 tables for colors, because as Joseba Machado say in the answer above, you could end up with the same color but different ColorId's.( one ColorId from datasource1 and another from datasource2). But I don't think using JOIN is the best solution, JOIN will get only the common Id's from both sources. What will happen with datasource1.apple.id that doesn't have a correspondent in datasource2.apple.id . I will add the suggested solution in the next comment. – Musti Dec 02 '22 at 10:24
  • I suggest doing a UNION between both 'Apple' table from datasource1 and datasource2 ( this will bring all data) and after a UNION between both 'AppleColor' table form both sources( you get all colors from both sources): – Musti Dec 02 '22 at 10:26
  • SELECT MergedSources.Apple,AppleColours.Colour FROM (SELECT datasource1.AppleName AS Apple, datasource1.AppleColours.ColourId AS ColourId, datasource1.AppleColours.ColourName AS Colour FROM datasource1.apple JOIN datasource1.AppleColours ON datasource1.apple.ColourID = datasource1.AppleColours.ColourID UNION SELECT datasource2.AppleName AS Apple, datasource2.AppleColours.ColourId AS ColourId, datasource2.AppleColours.ColourName AS Colour FROM datasource2.apple JOIN datasource2.AppleColours ON datasource2.apple.ColourID = datasource2.AppleColours.ColourID) MergedSources – Musti Dec 02 '22 at 10:31
  • ---part 2 ( SELECT DISTINCT ColourName AS Colour FROM datasource1.AppleColours UNION SELECT DISTINCT ColourName AS Colour FROM datasource2.AppleColours) AppleColours ON AppleColours.Colour = MergedSources.Colour – Musti Dec 02 '22 at 10:32
  • No you have both apples' and corresponding colors from both sources. Next step you can create a separate table for colors using again the UNION clause between 'AppleColor' tables – Musti Dec 02 '22 at 10:35
  • CREATE TABLE AppleCOlorMerged ( ID INT NOT NULL IDENTITY(1,1), ColorName NVARCHAR(10)) INSERT INTO AppleCOlorMerged (ColorName) SELECT DISTINCT ColourName AS ColorName FROM datasource1.AppleColours UNION SELECT DISTINCT ColourName AS ColorName FROM datasource2.AppleColours – Musti Dec 02 '22 at 10:35
  • Next step is to replace AppleColours.Colour from the first query with the Id's from the newly created table - AppleColorMerged – Musti Dec 02 '22 at 10:42
0

Maybe including a subquery making an UNION (you will get only one of each), but your problem still will be that, as each table has its own relationship with colours and you are joining both, same item can give you two different colours. My proposal: to make a switch to choose only one, If first is null, choose second, and if second is also null, a default value (some colour code). Other options are to use the lower id, because it was early created, or higher because it was the last...

Something like that

SELECT datasource1.*, datasource2.*, Q.Name, Q.Value FROM datasource1.apple 
JOIN datasource2.apple on datasource1.apple.id = datasource2.apple.id
JOIN
(SELECT ColourID, Name, Value FROM datasource1.AppleColours UNION SELECT ColourID, Name, Value FROM datasource2.AppleColours) Q
ON Q.ColourID = COALESCE(datasource1.ColourID, datasource2.ColourID, {DefaultColor})
0

Are the two data sources supposed to represent slices of the same real population?

I.e., if full joining datasource1 with datasource2 on apple.id is logically consistent, then full joining AppleColours between the 2 datasources should be logically correct as well.

The one-to-many then logically preserves the information from the two datasets, and remains correctly one-to-many. If there are any relationships cardinality violations as the results of this join - those weren't the right cardinalities to begin with.

(btw, should be a full join)

Muppet Bear
  • 148
  • 8