Debezium Postgres confusion around snapshot & batch config

Question

Could anyone explain what is the difference between those 4 variables and their co-dependencies when doing a snapshot with Postgres ? Is batch and queue size is being ignored during snapshot ?

max.batch.size: Maximum size of each batch of events that the connector processes.
max.queue.size: Maximum number of records that the blocking queue can hold.
snapshot.fetch.size: Maximum number of rows in a batch (queried from the database).
incremental.snapshot.chunk.size: Maximum number of rows that the connector fetches and reads into memory.

score 2 · Answer 1 · answered Mar 19 '23 at 09:23

Here's a explanation of the four variables and their dependencies when performing a snapshot with Postgres:

max.batch.size: This variable determines the maximum size of each batch of events that the Postgres connector processes. When taking a snapshot, this variable is not typically used as the snapshot operation reads the entire contents of the database and writes it to the replication stream as a single batch. However, this variable is relevant when processing live changes to the database as it determines the size of the batches that are written to the replication stream.
max.queue.size: This variable determines the maximum number of records that the blocking queue can hold. The blocking queue is used to buffer the data as it is processed by the connector. When taking a snapshot, this variable is not typically used as the snapshot operation reads the entire contents of the database and writes it to the replication stream without buffering. However, this variable is relevant when processing live changes to the database as it determines the size of the buffer used to hold the data being processed.
snapshot.fetch.size: This variable determines the maximum number of rows in a batch that the connector can fetch from the database when performing a snapshot. This variable is critical when taking a snapshot as it determines the amount of data that is read from the database at a time and written to the replication stream. A smaller fetch size may result in more round trips to the database, while a larger fetch size may use up more memory.
incremental.snapshot.chunk.size: This variable determines the maximum number of rows that the connector fetches and reads into memory when performing an incremental snapshot. This variable is critical when taking an incremental snapshot as it determines the amount of data that is read from the database and buffered in memory before it is written to the replication stream. A smaller chunk size may result in more round trips to the database, while a larger chunk size may use up more memory.

In summary, when performing a snapshot with Postgres, the snapshot.fetch.size and incremental.snapshot.chunk.size variables are critical for controlling the amount of data that is read from the database and written to the replication stream. The max.batch.size and max.queue.size variables are typically not used during snapshot operations but are relevant when processing live changes to the database. All four variables are interrelated and should be set appropriately to achieve optimal performance and avoid overloading the system's resources.

If I understand well debezium will fetch x batch of row until the chunk size is filled ? — Jonathan Chevalier, Mar 19 '23 at 13:26
During a snapshot operation, the Debezium connector will query the database for data in batches based on the snapshot.fetch.size parameter. Once a batch of data is retrieved, it will be written to the replication stream as a single batch without buffering. During an incremental snapshot operation, Debezium will query the database based on the incremental.snapshot.chunk.size . Once a batch of changes is retrieved, it will be buffered in memory until the incremental.snapshot.chunk.size limit is reached and will be written to the replication stream as a single batch. — ITLoook, Mar 19 '23 at 15:44
When using the word "snapshot operation" are you referring to initial snapshot ? I am asking because I was able to improve by a factor of 10 the initial snapshot time by only modifying the batch and queue size and without updating the **snapshot.fetch.size** cc: https://stackoverflow.com/questions/55839310/debezium-is-failing-to-snapshot-big-table-size — Jonathan Chevalier, Mar 20 '23 at 16:30

Debezium Postgres confusion around snapshot & batch config

1 Answers1