Debezium is failing to snapshot big table size

Question

I think I might have missed some configuration, but we're trying to use Debezium to snapshot all the rows from a table that has about 8 million records and after some time it stops.

The connector config is:

{
   "connector.class":"io.debezium.connector.mysql.MySqlConnector",
   "database.user":"MyUser",
   "database.server.id":"12345",
   "tasks.max":"1",
   "database.history.kafka.bootstrap.servers":"MyKafka:9092",
   "database.history.kafka.topic":"MyConnectorHistory",
   "database.server.name":"MyDbName",
   "database.port":"3306",
   "table.whitelist":"BigTable",
   "decimal.handling.mode":"double",
   "database.hostname":"***",
   "database.password":"***",
   "name":"MyConnector",
   "database.whitelist":"MyDb",
   "snapshot.mode":"initial_only",
   "connect.timeout.ms":"60000"
}

The connector starts to scan the rows:

  April 24th 2019, 13:06:52.573 2019-04-24 16:06:52,569 INFO   MySQL|MyDbName|snapshot  Step 9: - 2040000 of 8609643 rows scanned from table 'MyDb.BigTable' after 00:59:29.129   [io.debezium.connector.mysql.SnapshotReader]
  ... other prints
  April 24th 2019, 12:17:28.448 2019-04-24 15:17:28,447 INFO   MySQL|MyDbName|snapshot  Step 9: - 50000 of 8609643 rows scanned from table 'MyDb.BigTable' after 00:10:05.008   [io.debezium.connector.mysql.SnapshotReader]
    April 24th 2019, 12:07:43.183   2019-04-24 15:07:43,183 INFO   MySQL|MyDbName|snapshot  Step 9: - 40000 of 8609643 rows scanned from table 'MyDb.BigTable' after 00:00:19.744   [io.debezium.connector.mysql.SnapshotReader]
    April 24th 2019, 12:07:36.499   2019-04-24 15:07:36,498 INFO   MySQL|MyDbName|snapshot  Step 9: - 30000 of 8609643 rows scanned from table 'MyDb.BigTable' after 00:00:13.059   [io.debezium.connector.mysql.SnapshotReader]
    April 24th 2019, 12:07:30.157   2019-04-24 15:07:30,157 INFO   MySQL|MyDbName|snapshot  Step 9: - 20000 of 8609643 rows scanned from table 'MyDb.BigTable' after 00:00:06.718   [io.debezium.connector.mysql.SnapshotReader]
    April 24th 2019, 12:07:25.116   2019-04-24 15:07:25,116 INFO   MySQL|MyDbName|snapshot  Step 9: - 10000 of 8609643 rows scanned from table 'MyDb.BigTable' after 00:00:01.677   [io.debezium.connector.mysql.SnapshotReader]
    April 24th 2019, 12:07:23.439   2019-04-24 15:07:23,439 INFO   MySQL|MyDbName|snapshot  Step 9: - scanning table 'MyDb.BigTable' (1 of 10 tables)   [io.debezium.connector.mysql.SnapshotReader]
    April 24th 2019, 12:07:23.427   2019-04-24 15:07:23,427 INFO   MySQL|MyDbName|snapshot  Step 8: tables were locked explicitly, but to get a consistent snapshot we cannot release the locks until we've read all tables.   [io.debezium.connector.mysql.SnapshotReader]
    April 24th 2019, 12:07:23.427   2019-04-24 15:07:23,427 INFO   MySQL|MyDbName|snapshot  Step 9: scanning contents of 10 tables while still in transaction   [io.debezium.connector.mysql.SnapshotReader]
    April 24th 2019, 12:07:23.143   2019-04-24 15:07:23,143 INFO   MySQL|MyDbName|snapshot  Step 7: generating DROP and CREATE statements to reflect current database schemas:   [io.debezium.connector.mysql.SnapshotReader]
    April 24th 2019, 12:07:23.142   2019-04-24 15:07:23,142 INFO   MySQL|MyDbName|snapshot  Step 6: read binlog position of MySQL master   [io.debezium.connector.mysql.SnapshotReader]
    April 24th 2019, 12:07:22.739   2019-04-24 15:07:22,739 INFO   MySQL|MyDbName|snapshot  Step 5: flush and obtain read lock for 10 tables (preventing writes)   [io.debezium.connector.mysql.SnapshotReader]
    April 24th 2019, 12:07:22.635   2019-04-24 15:07:22,635 INFO   MySQL|MyDbName|snapshot  Step 4: read list of available tables in each database   [io.debezium.connector.mysql.SnapshotReader]
    April 24th 2019, 12:07:22.633   2019-04-24 15:07:22,633 INFO   MySQL|MyDbName|snapshot  Step 3: read list of available databases   [io.debezium.connector.mysql.SnapshotReader]
    April 24th 2019, 12:07:22.632   2019-04-24 15:07:22,632 INFO   MySQL|MyDbName|snapshot  Step 2: start transaction with consistent snapshot   [io.debezium.connector.mysql.SnapshotReader]
    April 24th 2019, 12:07:22.632   2019-04-24 15:07:22,631 INFO   MySQL|MyDbName|snapshot  Step 1: unable to flush and acquire global read lock, will use table read locks after reading table names   [io.debezium.connector.mysql.SnapshotReader]
    April 24th 2019, 12:07:22.617   2019-04-24 15:07:22,617 INFO   MySQL|MyDbName|snapshot  Step 1: flush and obtain global read lock to prevent writes to database   [io.debezium.connector.mysql.SnapshotReader]

Then after some time, we get

    Failed to flush, timed out while waiting for producer to flush outstanding 4094 messages
    Failed to commit offsets   [org.apache.kafka.connect.runtime.SourceTaskOffsetCommitter]

Then, the scan stops and we have several attempts to flush commit offsets again:

  April 24th 2019, 12:34:08.641 2019-04-24 15:34:08,641 ERROR  ||  WorkerSourceTask{id=MyConnectorr-0} Failed to commit offsets   [org.apache.kafka.connect.runtime.SourceTaskOffsetCommitter]
    April 24th 2019, 12:34:08.640   2019-04-24 15:34:08,640 ERROR  ||  WorkerSourceTask{id=MyConnectorr-0} Failed to flush, timed out while waiting for producer to flush outstanding 5560 messages   [org.apache.kafka.connect.runtime.WorkerSourceTask]
    April 24th 2019, 12:33:18.640   2019-04-24 15:33:18,640 INFO   ||  WorkerSourceTask{id=MyConnectorr-0} Committing offsets   [org.apache.kafka.connect.runtime.WorkerSourceTask]
    April 24th 2019, 12:33:18.640   2019-04-24 15:33:18,640 INFO   ||  WorkerSourceTask{id=MyConnectorr-0} flushing 5560 outstanding messages for offset commit   [org.apache.kafka.connect.runtime.WorkerSourceTask]
    April 24th 2019, 12:32:18.640   2019-04-24 15:32:18,640 ERROR  ||  WorkerSourceTask{id=MyConnectorr-0} Failed to commit offsets   [org.apache.kafka.connect.runtime.SourceTaskOffsetCommitter]
    April 24th 2019, 12:32:18.639   2019-04-24 15:32:18,639 ERROR  ||  WorkerSourceTask{id=MyConnectorr-0} Failed to flush, timed out while waiting for producer to flush outstanding 5560 messages   [org.apache.kafka.connect.runtime.WorkerSourceTask]
    April 24th 2019, 12:31:28.639   2019-04-24 15:31:28,639 INFO   ||  WorkerSourceTask{id=MyConnectorr-0} Committing offsets   [org.apache.kafka.connect.runtime.WorkerSourceTask]
    April 24th 2019, 12:31:28.639   2019-04-24 15:31:28,639 INFO   ||  WorkerSourceTask{id=MyConnectorr-0} flushing 5560 outstanding messages for offset commit   [org.apache.kafka.connect.runtime.WorkerSourceTask]
    April 24th 2019, 12:30:28.639   2019-04-24 15:30:28,639 ERROR  ||  WorkerSourceTask{id=MyConnectorr-0} Failed to commit offsets   [org.apache.kafka.connect.runtime.SourceTaskOffsetCommitter]
    April 24th 2019, 12:30:28.636   2019-04-24 15:30:28,635 ERROR  ||  WorkerSourceTask{id=MyConnectorr-0} Failed to flush, timed out while waiting for producer to flush outstanding 652 messages   [org.apache.kafka.connect.runtime.WorkerSourceTask]
    April 24th 2019, 12:29:38.635   2019-04-24 15:29:38,635 INFO   ||  WorkerSourceTask{id=MyConnectorr-0} flushing 5556 outstanding messages for offset commit   [org.apache.kafka.connect.runtime.WorkerSourceTask]
    April 24th 2019, 12:29:38.635   2019-04-24 15:29:38,635 INFO   ||  WorkerSourceTask{id=MyConnectorr-0} Committing offsets

After some time (about 9 ~ 10 minutes) it appears to be successful and starts to scan the rows again. But after some time it fails again and then, without finishing all the records the connector changes its status to FAIL

One of thee errors was

{
   "name":"MyConnector",
   "connector":{
      "state":"RUNNING",
      "worker_id":"svc.cluster.local:8083"
   },
   "tasks":[
      {
         "state":"FAILED",
         "trace":"org.apache.kafka.connect.errors.ConnectException: OffsetStorageWriter is already flushing\n\tat org.apache.kafka.connect.storage.OffsetStorageWriter.beginFlush(OffsetStorageWriter.java:110)\n\tat org.apache.kafka.connect.runtime.WorkerSourceTask.commitOffsets(WorkerSourceTask.java:318)\n\tat org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:197)\n\tat org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:170)\n\tat org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:214)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n",
         "id":0,
         "worker_id":"svc.cluster.local:8083"
      }
   ],
   "type":"source"
}

I read this issue: https://github.com/confluentinc/kafka-connect-jdbc/issues/161 and tried to change the parameter values as suggested. It's better but it still failing after some tim: For now, my connect config is:

OFFSET_FLUSH_INTERVAL_MS: 60000
OFFSET_FLUSH_TIMEOUT_MS: 50000
CONNECT_PRODUCER_BUFFER_MEMORY: 45554432

I also tried these values described here: Debezium flush timeout and OutOfMemoryError errors with MySQL

One thing that I haven't tried yet is using the snapshot.select.statement.overrides parameter. But I'm not sure it will help because sometimes the commit offset problem occurs at 100k messages. I would have to resume and stop the connector a lot of times.

Could you increase the timeout value even more? – Jiri Pechanec Apr 30 '19 at 07:14 — Jiri Pechanec, Apr 30 '19 at 07:14

score 9 · Answer 1 · answered Sep 19 '19 at 20:39

9

I've used Debezium to snapshot MySQL database that had multiple tables over 30M+ records. We had one that had 100M+ records though. For that one, I used the select statement override config (as it was an INSERT only table).

Initially, using default settings to snapshot the database, I ran into the exact same issues as you are facing. I adjusted the following configurations and it helped solve my issue.

kafka connect worker configs set in worker.properties config file:

offset.flush.timeout.ms=60000
offset.flush.interval.ms=10000
max.request.size=10485760

Decreasing the offset flush interval allows Kafka connect to flush the offsets more frequently, and setting a large timeout gives it more time to get an acknowledgement for the commit.

Debezium configs passed through the curl request to initialize it:

max.queue.size = 81290
max.batch.size = 20480

The default size for queue is 8192, which is quite low for a larger database. Bumping these configurations up helped a lot.

Hope this helps with your issue

answered Sep 19 '19 at 20:39

mahdir24

126
1
3

1

Out of curiousity, how long did it take you to snapshot those 30 million records? – Gunnar Sep 27 '19 at 09:14
3

The rate was around 1M records every 5 minutes on average. This varied depending on the table, but it was always between 4-6 minutes. Please note though that our Kafka cluster (4 nodes, 16GB RAM, 4 cores) has SASL/SSL enforced, so that slowed down things a decent bit. When I tested it out with a Kafka cluster of the same size but without SASL/SSL, I noticed it was doing about a million every 2 minutes or so. Couple of other things that affected speed was the producer.acks setting and replication factor of those topics in Kafka. Snapshotting the db (800M+ records total) took 3 days. – mahdir24 Nov 04 '19 at 22:33
Hi @mahdir24, thanks for your information and that saved my time on debezium as well. We are generating incremental logs from large table as well(100M). However look like our snapshot speed is only 20k rows every minute and quite slow. It is just simple metadata table with 12 text, date, string columns. Do you have any suggestion on this? Thanks in advance! – Martin Peng Mar 05 '21 at 07:33
I am using Kinesis with 4 shards and I am getting a speed of 10k records for 5 mins. Any help on how to increase the speed? – Pavan Kumar Aryasomayajulu Mar 16 '21 at 11:37
Hi @MartinPeng Do you have visibility into the queue capacity used by Kafka Connect process? I believe they are exposed via JMX metrics. Usually if it is very slow, the bottleneck tends to be on the kafka side. You can verify by checking if your queue is full all the time while snapshotting. Some other things to look into: - Producing to multiple partitions for a topic - Decreasing replication factor of topics - decreasing the producer acknowledgements from all to 1. However, be very careful with this setting as it can lead to data loss. – mahdir24 Apr 22 '21 at 00:27
Just some additional feedback on this - we added the settings above to try and speed up capturing large tables, and while it did help, we started experiencing regular errors like this: `Invalid call to OffsetStorageWriter flush() while already flushing, the framework should not allow this` Setting the flush interval to be *longer* than the timeout has gotten rid of this error. – terryf82 Aug 09 '21 at 03:53

Debezium is failing to snapshot big table size

1 Answers1

Linked