Pentaho Data Integration (PDI) How to use postgresql bulk loader? My transformation running forever

Question

I'm new to PDI, im using PDI 7, i have excel input with 6 rows and want to insert it into postgresDB. My transformation is : EXCEL INPUT --> Postgres Bulk Loader (2 steps only).

Condition 1 : When i Run the transformation the Postgres Bulk Load not stopping and not inserting anything into my postgresDB.

Condition 2 : So, I add "Insert/Update" step after Postgres Bulk Loader, and all data inserted to postgresDB which means success, but the bulk loader still running.

My transformation

From all sources i can get, they only need input and Bulk Loader step, and the after finished the transformation, the bulk loader is "finished" (mine's "running"). So, i wanna ask how to to this properly for Postgres? Do i skipped something important? Thanks.

score 1 · Answer 1 · answered Jan 24 '17 at 17:18

1

The PostgreSQL bulk loader used to be only experimental. Haven't tried it in some time. Are you sure you need it? If you're loading from Excel, it's unlikely you'll have enough rows to warrant use of a bulk loader.

Try just the regular Table Output step. If you're only inserting, you shouldn't need the Insert/Update step either.

answered Jan 24 '17 at 17:18

Brian.D.Myers

2,448
2
20
17

I dont exactly know if i need bulk loader or not, what i know is _insert/update_ took too long 150k rows=27-33mins, with _table output_ it took only 1 sec but it's only when i need to do insert only. For update delete it's not usefull. In near future, i might need to do update or delete too, so can you give me any solution? Thanks. – blackgee Jan 25 '17 at 07:17
For update/delete check out the `Merge Rows (diff)` step. Your `Insert/Update` step is probably slow because the underlying table is missing an index. – Brian.D.Myers Jan 25 '17 at 17:34
_Merge Rows (diff)_ combine with _Table Output_ seems ok. Still, i'm wondering how to make insert/update faster, i have already set an index by the way when the problem occurred. Thanks Brian. – blackgee Jan 26 '17 at 02:58
`Merge Rows (diff)` should not be combined with `Table Output`; it should be combined with `Synchronize after merge`. The goal is to load only the rows that have changed. Each row is marked with "new", "changed", "deleted" or "identical". If the number of identical rows is large and the others are small, it should be pretty fast. – Brian.D.Myers Jan 26 '17 at 19:40
True, _Merge Rows (diff), Sync after merge, sort, filter: dummy (false) table output (true)_ seems ok, it's definitely faster than _Merge Rows (diff), Sync after merge, sort, filter: dummy (false) insert/update (true)_. Thanks. – blackgee Jan 27 '17 at 02:00

score 1 · Answer 2 · answered Jan 25 '17 at 14:13

1

To insert just 7 rows you don't need bulk loader. Bulk loader designed to load huge amount of data. It uses native psql client. PSQL client transfers data much faster since it uses all features of binary protocol without any restriction of jdbc specification. JDBC is used in other steps like Table Output. Most of time Table Output is enough sufficient.

Postgres Bulk Loader step just builds in memory data in csv format from incoming steps and pass them to psql client.

answered Jan 25 '17 at 14:13

simar

1,782
3
16
33

Hi, Simar. My actual data is 350k rows. The 7 rows only to simulate how it works. The result is with both data : 350k rows and 7 rows i get result of nothing inserted and bulk loader step with _running_ status. **Postgres Bulk Loader step just builds in memory data in csv format from incoming steps and pass them to psql client.** Pass them to psql client but why nothing inserted? Did i missed something? – blackgee Jan 26 '17 at 03:03
I normally have transformation with inserts over million rows. It takes some time. Database and PDI hosts in same data-center. DB postgres. Total rows 3.2mln row. Transformation loads data concurrently in 4 thread, it takes like 30minutes. – simar Jan 26 '17 at 07:22
Another example PDI and DB on same host. 90_000 rows from CSV file to DB postgres. Total time around 4s. – simar Jan 26 '17 at 07:26
90k rows from CSV to DB Postgres with insert = 4s is fast. Compared to my 350k rows 3 threads, 20k rows commited, from CSV to DB postgres with insert = 33 mins. Even your million rows insert with 4 threads = 30 mins is faster. Oh i forgot to mentioned, i have tried replace the insert update step with a dummy, it only took less than 1 sec. And as i mentioned before table output also took less than 1 sec. This is absolutely a bottleneck on insert/update step. Is there anyway to make it even slightly faster than 33 mins? Even 5 mins is not acceptable. I need it work faster less than 5 mins. – blackgee Jan 27 '17 at 02:07
Why avoid using insert/update step -> http://www.graymatter.co.in/blog/pdi-best-practices-why-avoid-insert-update-step – simar Feb 15 '17 at 09:23

simar · Accepted Answer · 2017-01-31T06:43:08.270

I did made some experiments.

Environment:

DB: Postgresv9.5x64
PDI KETTLE v5.2.0
PDI KETTLE defautl jvm settings 512mb
Data source: DBF FILE over 2_215_000 rows
Both PDI and Kettle on same localhost
Table truncated on each run
PDI Kettle restarted on each run(to avoid heavily CPU load of gc run due huge amount rows)

Results are underneath to help you make decision

Bulk loader: average over 150_000 rows per second around 13-15s
Table output (sql inserts): average 11_500 rows per second. Total is around 3min 18s
Table output (batch inserts, batch size 10_000): average 28_000 rows per second. Total is around 1min 30s
Table output (batch inserts in 5 threads batch size 3_000): average 7_600 rows per second per each thread. Means around 37_000 rows per second. Total time is around 59s.

Advantage of Buld loader is that is doesn't fill memory of jmv, all data is streamed into psql process immediately.

Table Output fill jvm memory with data. Actually after around 1_600_000 rows memory is full and gc is started. CPU that time loaded up to 100% and speed slows down significantly. That is why worth to play with batch size, to find value which will provide best performance (bigger better), but on some level cause GC overhead.

Last experiment. Memory provided to jvm is enough to hold data. This can be tweaked in variable PENTAHO_DI_JAVA_OPTIONS. I set value of jvm heap size to 1024mb and increased value of batch size.

Table output (batch inserts in 5 threads batch size 10_000): average 12_500 rows per second per each thread. Means total around 60_000 rows per second. Total time is around 35s.

Now much easier to make decision. But your have to notice the fact, that kettle pdi and database located on same host. In case if hosts are different network bandwidth can play some role in performance.

This is really useful experiment. Bulk loader good option for larger data, but back to my post, when i use bulk loader it's just running forever and insert nothing, but it reads all rows. Can u give me screenshot of your experiment step with bulk load? Because my steps are : Excel input and Postgres Bulk Loader, is unsuccessful. I wonder if i need another step after running the two steps. Thanks Simar. — blackgee, Jan 27 '17 at 02:17

score 0 · Answer 4 · answered Feb 15 '17 at 10:08

Slow insert/update step Why you have to avoid using insert/update (in case of huge amount of data processed or you are limited by time)?

Let' look on documentation

The Insert/Update step first looks up a row in a table using one or more lookup keys. If the row can't be found, it inserts the row. If it can be found and the fields to update are the same, nothing is done. If they are not all the same, the row in the table is updated.

Before states, for each row in stream step will execute 2 queries. It is lookup first and then update or insert. Source of PDI Kettle states that PreparedStatement is used for all queries: insert, update and lookup.

So if this step is bottleneck then, try to figure out what exactly slow.

Is lookup slow? (Run manually lookup query on database on sample data. Check is it slow ? Does lookup fields has index on those columns used to find correspond row in database)
Is update slow? (Run manually lookup query on database on sample data. Check is is slow? Does update where clause use index on lookup fields)

Anyway this step is slow since it requires a lot of network communication, and data processing in kettle.

The only way to make it faster, is to load all data in database into "temp" table and call function which will upsert data. Or just use simple sql step in job to do the same.

Pentaho Data Integration (PDI) How to use postgresql bulk loader? My transformation running forever

4 Answers4