I think I might have missed some configuration, but we're trying to use Debezium to snapshot all the rows from a table that has about 8 million records and after some time it stops.
The connector config is:
{
"connector.class":"io.debezium.connector.mysql.MySqlConnector",
"database.user":"MyUser",
"database.server.id":"12345",
"tasks.max":"1",
"database.history.kafka.bootstrap.servers":"MyKafka:9092",
"database.history.kafka.topic":"MyConnectorHistory",
"database.server.name":"MyDbName",
"database.port":"3306",
"table.whitelist":"BigTable",
"decimal.handling.mode":"double",
"database.hostname":"***",
"database.password":"***",
"name":"MyConnector",
"database.whitelist":"MyDb",
"snapshot.mode":"initial_only",
"connect.timeout.ms":"60000"
}
The connector starts to scan the rows:
April 24th 2019, 13:06:52.573 2019-04-24 16:06:52,569 INFO MySQL|MyDbName|snapshot Step 9: - 2040000 of 8609643 rows scanned from table 'MyDb.BigTable' after 00:59:29.129 [io.debezium.connector.mysql.SnapshotReader]
... other prints
April 24th 2019, 12:17:28.448 2019-04-24 15:17:28,447 INFO MySQL|MyDbName|snapshot Step 9: - 50000 of 8609643 rows scanned from table 'MyDb.BigTable' after 00:10:05.008 [io.debezium.connector.mysql.SnapshotReader]
April 24th 2019, 12:07:43.183 2019-04-24 15:07:43,183 INFO MySQL|MyDbName|snapshot Step 9: - 40000 of 8609643 rows scanned from table 'MyDb.BigTable' after 00:00:19.744 [io.debezium.connector.mysql.SnapshotReader]
April 24th 2019, 12:07:36.499 2019-04-24 15:07:36,498 INFO MySQL|MyDbName|snapshot Step 9: - 30000 of 8609643 rows scanned from table 'MyDb.BigTable' after 00:00:13.059 [io.debezium.connector.mysql.SnapshotReader]
April 24th 2019, 12:07:30.157 2019-04-24 15:07:30,157 INFO MySQL|MyDbName|snapshot Step 9: - 20000 of 8609643 rows scanned from table 'MyDb.BigTable' after 00:00:06.718 [io.debezium.connector.mysql.SnapshotReader]
April 24th 2019, 12:07:25.116 2019-04-24 15:07:25,116 INFO MySQL|MyDbName|snapshot Step 9: - 10000 of 8609643 rows scanned from table 'MyDb.BigTable' after 00:00:01.677 [io.debezium.connector.mysql.SnapshotReader]
April 24th 2019, 12:07:23.439 2019-04-24 15:07:23,439 INFO MySQL|MyDbName|snapshot Step 9: - scanning table 'MyDb.BigTable' (1 of 10 tables) [io.debezium.connector.mysql.SnapshotReader]
April 24th 2019, 12:07:23.427 2019-04-24 15:07:23,427 INFO MySQL|MyDbName|snapshot Step 8: tables were locked explicitly, but to get a consistent snapshot we cannot release the locks until we've read all tables. [io.debezium.connector.mysql.SnapshotReader]
April 24th 2019, 12:07:23.427 2019-04-24 15:07:23,427 INFO MySQL|MyDbName|snapshot Step 9: scanning contents of 10 tables while still in transaction [io.debezium.connector.mysql.SnapshotReader]
April 24th 2019, 12:07:23.143 2019-04-24 15:07:23,143 INFO MySQL|MyDbName|snapshot Step 7: generating DROP and CREATE statements to reflect current database schemas: [io.debezium.connector.mysql.SnapshotReader]
April 24th 2019, 12:07:23.142 2019-04-24 15:07:23,142 INFO MySQL|MyDbName|snapshot Step 6: read binlog position of MySQL master [io.debezium.connector.mysql.SnapshotReader]
April 24th 2019, 12:07:22.739 2019-04-24 15:07:22,739 INFO MySQL|MyDbName|snapshot Step 5: flush and obtain read lock for 10 tables (preventing writes) [io.debezium.connector.mysql.SnapshotReader]
April 24th 2019, 12:07:22.635 2019-04-24 15:07:22,635 INFO MySQL|MyDbName|snapshot Step 4: read list of available tables in each database [io.debezium.connector.mysql.SnapshotReader]
April 24th 2019, 12:07:22.633 2019-04-24 15:07:22,633 INFO MySQL|MyDbName|snapshot Step 3: read list of available databases [io.debezium.connector.mysql.SnapshotReader]
April 24th 2019, 12:07:22.632 2019-04-24 15:07:22,632 INFO MySQL|MyDbName|snapshot Step 2: start transaction with consistent snapshot [io.debezium.connector.mysql.SnapshotReader]
April 24th 2019, 12:07:22.632 2019-04-24 15:07:22,631 INFO MySQL|MyDbName|snapshot Step 1: unable to flush and acquire global read lock, will use table read locks after reading table names [io.debezium.connector.mysql.SnapshotReader]
April 24th 2019, 12:07:22.617 2019-04-24 15:07:22,617 INFO MySQL|MyDbName|snapshot Step 1: flush and obtain global read lock to prevent writes to database [io.debezium.connector.mysql.SnapshotReader]
Then after some time, we get
Failed to flush, timed out while waiting for producer to flush outstanding 4094 messages
Failed to commit offsets [org.apache.kafka.connect.runtime.SourceTaskOffsetCommitter]
Then, the scan stops and we have several attempts to flush commit offsets again:
April 24th 2019, 12:34:08.641 2019-04-24 15:34:08,641 ERROR || WorkerSourceTask{id=MyConnectorr-0} Failed to commit offsets [org.apache.kafka.connect.runtime.SourceTaskOffsetCommitter]
April 24th 2019, 12:34:08.640 2019-04-24 15:34:08,640 ERROR || WorkerSourceTask{id=MyConnectorr-0} Failed to flush, timed out while waiting for producer to flush outstanding 5560 messages [org.apache.kafka.connect.runtime.WorkerSourceTask]
April 24th 2019, 12:33:18.640 2019-04-24 15:33:18,640 INFO || WorkerSourceTask{id=MyConnectorr-0} Committing offsets [org.apache.kafka.connect.runtime.WorkerSourceTask]
April 24th 2019, 12:33:18.640 2019-04-24 15:33:18,640 INFO || WorkerSourceTask{id=MyConnectorr-0} flushing 5560 outstanding messages for offset commit [org.apache.kafka.connect.runtime.WorkerSourceTask]
April 24th 2019, 12:32:18.640 2019-04-24 15:32:18,640 ERROR || WorkerSourceTask{id=MyConnectorr-0} Failed to commit offsets [org.apache.kafka.connect.runtime.SourceTaskOffsetCommitter]
April 24th 2019, 12:32:18.639 2019-04-24 15:32:18,639 ERROR || WorkerSourceTask{id=MyConnectorr-0} Failed to flush, timed out while waiting for producer to flush outstanding 5560 messages [org.apache.kafka.connect.runtime.WorkerSourceTask]
April 24th 2019, 12:31:28.639 2019-04-24 15:31:28,639 INFO || WorkerSourceTask{id=MyConnectorr-0} Committing offsets [org.apache.kafka.connect.runtime.WorkerSourceTask]
April 24th 2019, 12:31:28.639 2019-04-24 15:31:28,639 INFO || WorkerSourceTask{id=MyConnectorr-0} flushing 5560 outstanding messages for offset commit [org.apache.kafka.connect.runtime.WorkerSourceTask]
April 24th 2019, 12:30:28.639 2019-04-24 15:30:28,639 ERROR || WorkerSourceTask{id=MyConnectorr-0} Failed to commit offsets [org.apache.kafka.connect.runtime.SourceTaskOffsetCommitter]
April 24th 2019, 12:30:28.636 2019-04-24 15:30:28,635 ERROR || WorkerSourceTask{id=MyConnectorr-0} Failed to flush, timed out while waiting for producer to flush outstanding 652 messages [org.apache.kafka.connect.runtime.WorkerSourceTask]
April 24th 2019, 12:29:38.635 2019-04-24 15:29:38,635 INFO || WorkerSourceTask{id=MyConnectorr-0} flushing 5556 outstanding messages for offset commit [org.apache.kafka.connect.runtime.WorkerSourceTask]
April 24th 2019, 12:29:38.635 2019-04-24 15:29:38,635 INFO || WorkerSourceTask{id=MyConnectorr-0} Committing offsets
After some time (about 9 ~ 10 minutes) it appears to be successful and starts to scan the rows again.
But after some time it fails again and then, without finishing all the records the connector changes its status to FAIL
One of thee errors was
{
"name":"MyConnector",
"connector":{
"state":"RUNNING",
"worker_id":"svc.cluster.local:8083"
},
"tasks":[
{
"state":"FAILED",
"trace":"org.apache.kafka.connect.errors.ConnectException: OffsetStorageWriter is already flushing\n\tat org.apache.kafka.connect.storage.OffsetStorageWriter.beginFlush(OffsetStorageWriter.java:110)\n\tat org.apache.kafka.connect.runtime.WorkerSourceTask.commitOffsets(WorkerSourceTask.java:318)\n\tat org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:197)\n\tat org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:170)\n\tat org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:214)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n",
"id":0,
"worker_id":"svc.cluster.local:8083"
}
],
"type":"source"
}
I read this issue: https://github.com/confluentinc/kafka-connect-jdbc/issues/161 and tried to change the parameter values as suggested. It's better but it still failing after some tim: For now, my connect config is:
OFFSET_FLUSH_INTERVAL_MS: 60000
OFFSET_FLUSH_TIMEOUT_MS: 50000
CONNECT_PRODUCER_BUFFER_MEMORY: 45554432
I also tried these values described here: Debezium flush timeout and OutOfMemoryError errors with MySQL
One thing that I haven't tried yet is using the snapshot.select.statement.overrides
parameter.
But I'm not sure it will help because sometimes the commit offset problem occurs at 100k messages. I would have to resume and stop the connector a lot of times.