Improving the Speed of Filter Rows in Pentaho

Question

I am trying to filter out records from an excel sheet with over 94000 records with a simple validation and getting them in a new Excel sheet using Pentaho . But the speed of reading the input file and filtering of the record reduces gradually to less than 50 r/s after passing 20000 records.

Is there a way to increase the speed of processing the records or maintain the initial speed which was 1000 r/s ?

score 2 · Answer 1 · answered Dec 06 '17 at 23:58

2

Place a dummy step in the place of Excel Output step and run the transformation again. If you get better throughput now, you can understand that excel writer is the culprit.

See https://support.pentaho.com/hc/en-us/articles/205715046-Best-Practices-Pentaho-Data-Integration-Performance-Tuning- to know more about performance tuning in Pentaho etl.

answered Dec 06 '17 at 23:58

Siby Abin

41
3

Thanks for the link – Enthusiastic Jan 09 '18 at 09:49

score 0 · Answer 2 · answered Nov 30 '17 at 10:09

I sniff the bottleneck is in the Excel Input step rather than the Filter step.

As a matter of fact, the Excel Input is very, very slow. Reason why I am using CSV Input each time it is possible.

The Filter step is quick, speed higher than a few thousands are common. IN your case, this step spends its time waiting to get rows from the Excel Input step, rather than in working. That's explains the speed is 49 row/s, not far from the 60 rows/s of the Input Excel.

The fact that the process slows down after x rows, is an indication that the memory is full and the JVM spends its time in disk swapping. Try to increase the memory size on the spoon.bat/spoon.sh. [set PENTAHO_DI_JAVA_OPTIONS="-Xms1024m" "-Xmx4096m" "-XX:MaxPermSize=256m"]

Something else you can try is to adjust the number of rows the PDI keeps in each steps. Click anywhere, Properties, Miscellaneous, Number of rows in row set. Reduce it until you find the right balance, between the size of the batch read by the Excel Input and total number of records kept in memory.

But the best is to avoid Excel 2007 XLSX Spredsheet type.

No, the bottleneck is in the Excel output. Look at the input/output buffers. Input step is waiting for the buffer to empty out. Excel output has 10k rows waiting to be written. — nsousa, Dec 02 '17 at 08:38

score 0 · Answer 3 · answered Dec 02 '17 at 08:40

Your bottleneck is in writting to Excel, as can be easily seen from the 10k rows waiting in its input buffer.

Writting to excel is known to be slow. You may want to try the Excel writer step instead, but don’t expect miracles.

Speed of a step is given by “number of rows passed through divided by total time”. If a downstream step is slow this will decrease with time. You can see it by disabling the output step and you’ll see the first two steps are much faster.

score 0 · Accepted Answer · answered Dec 15 '17 at 21:34

I think the best way to solve this is split the current transformation in 2 transformations and 1 job. The first transformation will Input the Excel Rows and filter, then use a Text File Output step as a "temporary" result. In the next transformation, read the CSV file created previously, and export it to excel with Excel Writer. Having the rows in plain-text CSV is faster than reading excel. Excel wirting is extremly limited in spoon. Have fun.

Improving the Speed of Filter Rows in Pentaho

4 Answers4