Why not approach C, :) a bit of both.
Both has pros and cons.
A - use python to build ETL -
pros - better control, flexible to do any logic you want.
cons - you have to code in python and code in sql. If something fails, it will be a nightmare to do RCA. Maintenance may be harder in comparison.
- performance wise, this approach will be poorer in case of huge volume of data.
B - Use SQL to fetch data directly -
pros - faster performance. less coding.
cons - difficult to implement complex logic. Maintenance of code and schedule may be hard.
In addition to above, pls consider, your/teams comfort on python/SQL and future maintainability.
Currently we are using approach B in my cloudera project. We create views and then use insert
to load final tables directly. We hardly need any UDF.
Now, my recommendation, please use approach B. And use approach A only in case you really can not create complex logic.
EDIT :
Lets say, we have to load orders table. So we execute following blocks to load orders and dependent org,cust,prod tables.
Load customer |
load org | --> Load Orders final.
load product |
load order stage|
Load customer block is collection of scripts like-
insert overwrite cust_stg select * from cust_stg_vw; -- This loads into stage table
insert overwrite cust select * from cust_vw; -- This loads into cust table
And similarly other blocks are written. Putting them in blocks gives us flexibility to put them in any order/anywhere we want to improve performance.