6

Using Debezium 0.7 to read from MySQL but getting flush timeout and OutOfMemoryError errors in the initial snapshot phase. Looking at the logs below it seems like the connector is trying to write too many messages in one go:

WorkerSourceTask{id=accounts-connector-0} flushing 143706 outstanding messages for offset commit   [org.apache.kafka.connect.runtime.WorkerSourceTask]
WorkerSourceTask{id=accounts-connector-0} Committing offsets   [org.apache.kafka.connect.runtime.WorkerSourceTask]
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java heap space
WorkerSourceTask{id=accounts-connector-0} Failed to flush, timed out while waiting for producer to flush outstanding 143706 messages   [org.apache.kafka.connect.runtime.WorkerSourceTask]

Wonder what the correct settings are http://debezium.io/docs/connectors/mysql/#connector-properties for sizeable databases (>50GB). I didn't have this issue with smaller databases. Simply increasing the timeout doesn't seem like a good strategy. I'm currently using the default connector settings.

Update

Changed the settings as suggested below and it fixed the problem:

OFFSET_FLUSH_TIMEOUT_MS: 60000  # default 5000
OFFSET_FLUSH_INTERVAL_MS: 15000  # default 60000
MAX_BATCH_SIZE: 32768  # default 2048
MAX_QUEUE_SIZE: 131072  # default 8192
HEAP_OPTS: '-Xms2g -Xmx2g'  # default '-Xms1g -Xmx1g'
Kamil Sindi
  • 21,782
  • 19
  • 96
  • 120

3 Answers3

10

This is a very complex question - first of all, the default memory settings for Debezium Docker images are quite low so if you are using them it might be necessary to increase them.

Next, there are multiple factors at play. I recommend to do follwoing steps.

  1. Increase max.batch.size and max.queue.size - reduces number of commits
  2. Increase offset.flush.timeout.ms - gives Connect time to process accumulated records
  3. Decrease offset.flush.interval.ms - should reduce the amount of accumulated offsets

Unfortunately there is an issue KAFKA-6551 lurking in backstage that can still play a havoc.

Jiri Pechanec
  • 1,816
  • 7
  • 8
  • 3
    Note that Debezium's options (`max.batch.size`, `max.queue.size`) must be given as connector options; specifying them as env variables won't work (that's only supported for the settings defined by Kafka (Connect). – Gunnar Jul 11 '18 at 13:25
  • hi,I am reading the documentation here https://debezium.io/documentation/reference/assemblies/cdc-mysql-connector/as_deploy-the-mysql-connector.html and I do not see `offset.flush.timeout.ms` or `offset.flush.interval.ms` config options. Where are you getting them from? – lollerskates Mar 10 '20 at 21:44
  • @lollerskates it is a connect worker config. have a look at this: https://docs.confluent.io/current/connect/references/allconfigs.html – sarah Jul 13 '20 at 19:43
2

To add onto what Jiri said:

There is now an open issue in the Debezium bugtracker, if you have any more information about root causes, logs or reproduction, feel free to provide them there.

For me, changing the values that Jiri mentioned in his comment did not solve the issue. The only working workaround was to create multiple connectors on the same worker that are responsible for a subset of all tables each. For this to work, you need to start connector 1, wait for the snapshot to complete, then start connector 2 and so on. In some cases, an earlier connector will fail to flush when a later connector starts to snapshot. In those cases, you can just restart the worker once all snapshots are completed and the connectors will pick up from the binlog again (make sure your snapshot mode is "when_needed"!).

swenzel
  • 6,745
  • 3
  • 23
  • 37
MrTrustworthy
  • 416
  • 1
  • 4
  • 14
2

I can confirm that the answer posted above by Jiri Pechanec solved my issues. This is the configurations I am using:

kafka connect worker configs set in worker.properties config file:

offset.flush.timeout.ms=60000
offset.flush.interval.ms=10000
max.request.size=10485760

Debezium configs passed through the curl request to initialize it:

max.queue.size = 81290
max.batch.size = 20480

We didn't run into this issue with our staging MySQL db (~8GB), because the dataset is a lot smaller. For production dataset (~80GB) , we had to adjust these configurations.

Hope this helps.

mahdir24
  • 126
  • 1
  • 3