Pentaho: combination lookup/update does not process all rows in the source

Question

I am using Pentaho Data Integration to do a SCD type 1 transformation. I am using combination lookup/update transform to generate the surrogate key value (upon insert). The commit size is 100000 and the cache size is 99999. My source table has 19763 rows and when I run the job to load data into the destination (dimension table), the combination lookup/update just processes 10000/19763 rows every single time.

How can I get it to process all the records (19763) in the source table ????

If the Lookup/Update steps and/or the Database Lookup are accessing the same table, they surely lock each other. What do you want to achieve? — AlainD, May 23 '18 at 15:08
The second step - Table input - is having 19763 records which I want to populate in my destination dimension table. But when the process reaches Combination/Lookup Update, the total no. of records is processes is 10002. Is it because of the locking???? — Kuldip.Das, May 24 '18 at 09:17

score 1 · Accepted Answer · answered May 24 '18 at 13:27

1

Finally !!!!!!!!! I found the answer. Its simple. Click on Edit -> Setting -> Miscellaneous -> Nr of rows in rowset - Change it from 10000 to the desired number of records coming from source. For me, the value was set to 10000 and hence it used to only write 10000 records to my destination dimension table. I changed it to a million and now I am getting all my 19763 records in my destination table.

answered May 24 '18 at 13:27

Kuldip.Das

191
1
3
13

At risk of repeating myself, this does NOT solve the issue, only hides it. In my opinion, the Database Lookup (which is rather slow by the way) is accessing the same table as the Insert/Update. The facts that it works with a larger row set goes in favor of this interpretation. – AlainD May 25 '18 at 09:20

score 1 · Answer 2 · answered May 25 '18 at 11:35

It seams you are doing an incremental update. There is a special step, named Merge Rows (Diff), to compare two streams and tell if they exists in both streams and if they have changed.

The two streams, a reference stream (the current data) and a compare stream (the new data), are merged. The row are merged and marked as :

identical The key was found in both streams and the values to compare are identical;
changed The key was found in both streams but one or more values is different;
new The key was not found in the reference stream;
deleted The key was not found in the compare stream.

The two streams must be sorted before to be merged. You can do this in the sql query, but it is best to put an explicit Sort row step, because otherwise the process will stop after reading 1000 records, or whatever is in the Rowset limit (seams familiar?).

The stream is then directed with a Swich/Case step to the appropriate action.

The "best pratice" pattern is as follows, in which I added a first step to get the max date, and a step to built the surrogate key.

This pattern has been proven since years as much faster. In facts, it avoids the very slow Database lookup (one db full search by row) and reduce the use of the slow Insert/Update step (3 access to the db: one to fetch the record, one to change the values and one to store it). And the sort (which can be pre-prepared in the stream) is pretty efficient.

Pentaho: combination lookup/update does not process all rows in the source

2 Answers2