1

I am using QueryDatabaseTable to read from a PDA/Netezza table that has about 70 million rows.

Nifi is set up as a single node instance with bootstrap.conf edited to give 16g RAM.

QueryDatabase table is using defaults except for

  • Max Rows Per Flow File: 1000000
  • Output Batch Size: 10

The result is that it is taking quite long, 1.5 hours for the full 70M, or approx. 1M records p/minute.

  1. Are there other config changes I can experiment with to improve performance?
  2. QueryDatabaseTable converts internally to Avro, is it possible to skip this conversion and just use text format from PDA/Netezza instead to possibly increase performance?

Planning also to set up a cluster soon but that will be 4 or 6 nodes so still looking to improve the performance down to 10mins or less. Is this possible?

bp2010
  • 2,342
  • 17
  • 34

2 Answers2

3

QueryDatabaseTable is meant to run on the Primary Node only as it is a single-source for fetching. This means it will not scale to a distributed/parallel solution such as Sqoop. Also if you hypothetically have 3 nodes in a NiFi cluster but 10 in a Hadoop cluster with Sqoop, naturally you'll get more parallelism in the latter.

However, NiFi has the GenerateTableFetch -> ExecuteSQL pattern for this. Instead of a single processor on a single node doing the full fetch, GenerateTableFetch will instead generate a number of flow files that each contain a SQL statement to fetch a "page" of data from the table. You can then use ExecuteSQL to actually fetch the rows.

GenerateTableFetch still runs only on the primary node, but it does not fetch the rows itself; instead you'd distribute the flow files among the cluster nodes using either a Remote Process Group -> Input Port on the same cluster, or in recent versions of NiFi you can use a Load-Balanced Connection between GenerateTableFetch and ExecuteSQL.

Once the flow files are distributed among the cluster, each node can run ExecuteSQL on them in parallel and fetch a page of data at a time for downstream processing.

For the output format, as of NiFi 1.8.0 there is ExecuteSQLRecord which allows you to output the rows in any format that has a RecordSetWriter, which includes Avro, JSON, CSV, XML, Free-text (for custom textual formats), and you can even script your own for more complex, proprietary, or currently unsupported formats. For completeness there is also a QueryDatabaseTableRecord processor, but for this answer I'd like to discourage you from using it to solve your use case :)

mattyb
  • 11,693
  • 15
  • 20
  • Thanks for the detailed response @mattyb. Once I set up a cluster I will experiment with the GenerateTableFetch -> ExecuteSQL pattern. I will also try using `ExecuteSQLRecord` with text output to see if it improves perf. Also, would an increase in RAM significantly improve speeds for QDT processor? (from 16g to 32g ??) – bp2010 Oct 04 '19 at 06:45
0

In my opinion 1M rows per minute is waaaay too slow for a Netezza system.

  • External tables/unload should be 50-100 times faster
  • single Select though ODBC to a fast client (I have used powercenter and SAS) 5-10 times faster

Please note that minimum query time on Netezza is close to 1/8 second, which means you need to get full value for ‘money’ in each query (quite possible btw) and I suspect that you framework tries to ‘over optimize’ this behind the scenes:)

Lars G Olsen
  • 1,093
  • 8
  • 11
  • 1M records p/minute – bp2010 Dec 19 '19 at 12:29
  • Oh - sorry - typo on my part, but my statement stands, I’ll edit my answer to say ‘minute’ – Lars G Olsen Dec 20 '19 at 15:49
  • Using an external table on a local disk on the NZ system I can get 100 mill rows out in a minute (150 bytes record size). If I use the ODBC clause and send the data to a server on 10G network I get half (50 mill rows per minute) – Lars G Olsen Jan 03 '20 at 20:14