SQL Server join or Pentaho Spoon lookup?

Question

What provides higher performance?

Writing a query using T-SQL, joining tables, then inserting the result into another table
Using Pentaho Spoon's table insert, then using database lookup to "join" each table at a time, then inserting the result into another table

The objective is to take a denormalized table, join it with 5 dimension tables by their text, and retrieve dimensions' PKs, and then insert the result into a fact table.

score 1 · Accepted Answer · answered Aug 08 '17 at 16:35

1

probably better suited for dba.stackexchange.com. But I guess a database engine is going to perform this task much faster, because a) it can optimize access to all tables involved using indexes and table statistics and b) you get rid of the overhead an ETL tool and multiple database queries introduce. Pentaho PDI processes rows individually, so for each row coming from your table input step you will have an SQL query for every lookup step.

answered Aug 08 '17 at 16:35

Dirk Trilsbeek

5,873
2
25
23

Thanks. If that's the case, then what's the purpose of having on Spoon components like join, groupby, etc, that would be done better on a SQL Engine? – Hikari Aug 08 '17 at 21:55
quite often the streams you want to join or group don't exist in SQL databases. You sometimes start with text files, or even with a table input, but then you start adding more fields, maybe some of them necessary for the join. Not to mention the options an ETL tool gives you in terms of filtering, value mapping etc. You could also design your ETL process to first create join-able tables and then join them in a query, but often outright performance isn't as important as simplicity. It is a judgement call based on your scenario. – Dirk Trilsbeek Aug 09 '17 at 06:02
Ah got it, it's available for data that isn't in RDBs! Is, is it really the best practice to priorize joining on RDB instead of Spoon whenever possible? – Hikari Aug 09 '17 at 16:46
I join in database queries when it's convenient, but I don't always optimize my ETL transformations towards that. ETL transformations can become quite complex and I tend to focus more on making them easy to understand. It is usually somewhat obvious whether joining on database level makes sense - is the data already in the database? Or is it such an enormous amount of data, that joining in a transformation would take forever? It really depends on your project, the size of your dataset, time constrictions and maybe complexity. To simplify: if you can choose freely, choose database joins. – Dirk Trilsbeek Aug 09 '17 at 16:54
Thanks a lot for your insights! – Hikari Aug 16 '17 at 21:53
I have serious reasons -- and scores of real life use cases -- to totally disagree. See my answer, and never underestimate Kettle efficiency. – AlainD Aug 18 '17 at 12:49
And I have scores of real life uses cases saying the exact opposite. You can of course go into a lot more detail and likely find some situation in which kettle outperforms a database engine, but I'm pretty sure you have to be quite specific in your specifications to come up with one that works in the real world. Try joining a couple of multi-million-rows tables in PDI, where a cache won't be much help, where every query will have to go through the database, where each query will be executed for millions of rows. You know, real world stuff. – Dirk Trilsbeek Aug 18 '17 at 15:49

AlainD · Answer 2 · 2017-08-18T13:20:54.127

It is conventional wisdom to think that SQL outperforms Pentaho PDI on complex queries. The truthiness comes from the blind believe that the SQL optimizer gives a real optimum.

I have a number of counter examples in which we have reduced the query time of more than one hour to a few minutes by extracting the SQL query complexity out into a series of lookups and filters.

We were better because:

The lookup expects one matching record per entry, while the SQL optimizer must take the assumption that the join is not unique. And it is the case of unfolding a star/snowflake schema like here.
The lookup step is really smart, reading just the data needed and keeping it in memory, making provision with internal sorted hashtables to speed up forthcoming queries.
The above is especially efficient when the flow is known to be sorted. And while a select from oneTable order by is quick especially when the table is suitably indexed , the same select from manyJoinedTables where LotsOfConditions order by may be pretty inefficient because the SQL cannot count on indexes.

In facts, I guess the above conditions are exactly the ones which the SQL optimizer wish to find and rely, but cannot because of generality.

As a rule of thumb be confident in the efficiency of PDI. Matt Casters and Jens Bleuel made a very good software which was tested under volume conditions you cannot even imagine.

So use the solution which is the easier to maintain (most the time PDI lookups) and if it is really, really to slow, then move it into Input Tables but do not expect to be systematically better.

Notes:

Avoid the Database Lookup (prepared statement uses cache, but we are in precisely the case in which we look for a different key each time).
Avoid Joins, i.e.: explicitly tell kettle it can count on a unique match, if you know it is the case. The Join Rows and Merge Join are efficient steps, but only when the incoming flows are sorted.
Use Filters (reduce the number of rows) as soon as possible. Even, every rule has its exception, in the SQL.
Don't bother to reduce the number of columns with Select values. It has almost no impact on the speed! You do not thing Kettle is naively rewriting the values from step to step, instead of using a clever system of pointers, don't you?.
Computations with a JavaScript is not so inefficient as the legend says, and in facts the PDI is usually much more busy in sorting and lookups.
Do not spread the aggregates in many Memory Group by steps. Each of these steps needs to read all the incoming flow before to know it is finished, so it is a blocking factor the next steps.
Usually the Sorted Group by does not improve the Memory Group by. The one exception is when the memory reach its quota and java starts to launch a garbage collector over garbage collector. In that case, a sort can be used to store data on temporary disk.
Avoid intermediary tables. Instead build the flow by adding columns and when the data is ready, throw it in a Output Table with a large commit size.

Why was this answer downvoted? It is a matter of fact. We have two dozen of examples in which clever coding with PDI outperformed the SQL standard optimizer. — AlainD, Aug 22 '17 at 09:51

SQL Server join or Pentaho Spoon lookup?

2 Answers2