Why is ExecuteSQLRecord taking a long time to start outputting flow files on large tables?

Question

I am using the ExecuteSQLRecord processor to dump the contents of a large table (100 GB) with 100+ million records.

I have set up the properties like below. However, what I am noticing is that it takes a good 45 minutes before I see any flow files coming out of this processor?

What am I missing?

I am on NiFi 1.9.1

Thank you.

You might want to try GenerateTableFetch -> ExecuteSQLRecord, this way the first processor calculates the "fetch queries", and then ExecuteSQLRecord will work on the smaller fetch queries. — Bryan Bende, Nov 12 '19 at 20:34

score 1 · Answer 1 · answered Nov 12 '19 at 21:04

An alternative to ExecuteSQL(Record) or even GenerateTableFetch -> ExecuteSQL(Record) is to use QueryDatabaseTable without a Max-Value Column. It has a Fetch Size property that attempts to set the number of rows returned on each pull from the database. Oracle's default is 10 for example, so with 10000 rows per flow file, ExecuteSQL has to make 1000 trips to the DB, fetching 10 rows at a time. I recommend setting Fetch Size to Max Rows Per Flow File as a general rule, then one fetch is made per outgoing flow file.

The Fetch Size property should be available to the ExecuteSQL processors as well, I wrote up Apache Jira NIFI-6865 to cover this improvement.

one of my requirements is to have the output as JSON. So need to use the QueryDatabaseTableRecord processor. However, still seeing the initial delay with that. I set the Fetch Size to Max Rows Per Flow File both to 10000. I set the Output Batch Size to 1, so it can emit a flowfile as soon as one is ready. — Vijay Kumar, Nov 12 '19 at 22:37

Why is ExecuteSQLRecord taking a long time to start outputting flow files on large tables?

1 Answers1

Linked