ETL choice, building an ETL that deals with SQL query engine (impala) or native database directly?

Question

I am trying to build an ETL that map the source tables to a dimensional, star schema model

our data warehouse is basically Impala on top of Kudu database

my question is, should I:

A- build an ETL that deals with kudu tables directly using Python (link)

or

B- or create UDFs (equivalent to stored procedures in SQL) in impala that does the insertion/joins etc to map source tables to star-schema model, and schedule it using Nifi or any scheduler such as Airflow etc

In my opinion, I think it would be better to deal with the native database rather than dealing with the SQL engine on top of it. but it is just an assumption.

Koushik Roy · Accepted Answer · 2021-04-18T18:04:04.843

Why not approach C, :) a bit of both.

Both has pros and cons.

A - use python to build ETL - pros - better control, flexible to do any logic you want. cons - you have to code in python and code in sql. If something fails, it will be a nightmare to do RCA. Maintenance may be harder in comparison. - performance wise, this approach will be poorer in case of huge volume of data.
B - Use SQL to fetch data directly - pros - faster performance. less coding. cons - difficult to implement complex logic. Maintenance of code and schedule may be hard.

In addition to above, pls consider, your/teams comfort on python/SQL and future maintainability.
Currently we are using approach B in my cloudera project. We create views and then use insert to load final tables directly. We hardly need any UDF.
Now, my recommendation, please use approach B. And use approach A only in case you really can not create complex logic.

EDIT : Lets say, we have to load orders table. So we execute following blocks to load orders and dependent org,cust,prod tables.

Load customer   |
load org        | --> Load Orders final.
load product    |
load order stage|

Load customer block is collection of scripts like-

insert overwrite cust_stg select * from cust_stg_vw; -- This loads into stage table
insert overwrite cust select * from cust_vw; -- This loads into cust table

And similarly other blocks are written. Putting them in blocks gives us flexibility to put them in any order/anywhere we want to improve performance.

sounds logical, however when you used the second approach "We create views and then use insert to load final tables directly." how did you schedule this? does it happen each x unit of time or does it happen with each record inserted? — Atheer Abdullatif, Apr 18 '21 at 10:40
We put the elements in groups/blocks and then create chain(parallel or sequential) and then run them using a scheduler tool called rundeck. Its basically a unix script scheduler which can create jobs as we want. I updated the question with more details. — Koushik Roy, Apr 18 '21 at 18:05

ETL choice, building an ETL that deals with SQL query engine (impala) or native database directly?

1 Answers1