kafka-connect-jdbc does not fetch consecutive timestamp from source

Question

I use kafka-connect-jdbc-4.0.0.jar and postgresql-9.4-1206-jdbc41.jar

configuration of connector of kafka connect

{
  "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
  "mode": "timestamp",
  "timestamp.column.name": "updated_at",
  "topic.prefix": "streaming.data.v2",
  "connection.password": "password",
  "connection.user": "user",
  "schema.pattern": "test",
  "query": "select * from view_source",
  "connection.url": "jdbc:postgresql://host:5432/test?currentSchema=test"
}

I have configured two connectors one source and another sink using the jdbc driver, against a postgresql database ("PostgreSQL 9.6.9") everything works correctly

I have doubts in how the connector collects the source data, looking at the log I see that between the execution of the queries there is a time difference of 21 seconds

11/1/2019 9:20:18[2019-01-11 08:20:18,985] DEBUG Checking for next block of results from TimestampIncrementingTableQuerier{name='null', query='select * from view_source', topicPrefix='streaming.data.v2', timestampColumn='updated_at', incrementingColumn='null'} (io.confluent.connect.jdbc.source.JdbcSourceTask)
11/1/2019 9:20:18[2019-01-11 08:20:18,985] DEBUG TimestampIncrementingTableQuerier{name='null', query='select * from view_source', topicPrefix='streaming.data.v2', timestampColumn='updated_at', incrementingColumn='null'} prepared SQL query: select * from view_source WHERE "updated_at" > ? AND "updated_at" < ? ORDER BY "updated_at" ASC (io.confluent.connect.jdbc.source.TimestampIncrementingTableQuerier)
11/1/2019 9:20:18[2019-01-11 08:20:18,985] DEBUG executing query select CURRENT_TIMESTAMP; to get current time from database (io.confluent.connect.jdbc.util.JdbcUtils)
11/1/2019 9:20:18[2019-01-11 08:20:18,985] DEBUG Executing prepared statement with timestamp value = 2019-01-11 08:17:07.000 end time = 2019-01-11 08:20:18.985 (io.confluent.connect.jdbc.source.TimestampIncrementingTableQuerier)
11/1/2019 9:20:19[2019-01-11 08:20:19,070] DEBUG Resetting querier TimestampIncrementingTableQuerier{name='null', query='select * from view_source', topicPrefix='streaming.data.v2', timestampColumn='updated_at', incrementingColumn='null'} (io.confluent.connect.jdbc.source.JdbcSourceTask)

11/1/2019 9:20:49[2019-01-11 08:20:49,499] DEBUG Checking for next block of results from TimestampIncrementingTableQuerier{name='null', query='select * from view_source', topicPrefix='streaming.data.v2', timestampColumn='updated_at', incrementingColumn='null'} (io.confluent.connect.jdbc.source.JdbcSourceTask)
11/1/2019 9:20:49[2019-01-11 08:20:49,500] DEBUG TimestampIncrementingTableQuerier{name='null', query='select * from view_source', topicPrefix='streaming.data.v2', timestampColumn='updated_at', incrementingColumn='null'} prepared SQL query: select * from view_source WHERE "updated_at" > ? AND "updated_at" < ? ORDER BY "updated_at" ASC (io.confluent.connect.jdbc.source.TimestampIncrementingTableQuerier)
11/1/2019 9:20:49[2019-01-11 08:20:49,500] DEBUG executing query select CURRENT_TIMESTAMP; to get current time from database (io.confluent.connect.jdbc.util.JdbcUtils)
11/1/2019 9:20:49[2019-01-11 08:20:49,500] DEBUG Executing prepared statement with timestamp value = 2019-01-11 08:20:39.000 end time = 2019-01-11 08:20:49.500 (io.confluent.connect.jdbc.source.TimestampIncrementingTableQuerier)

the first query collects data between 08: 17: 07.000 and 08: 20: 18.985, but the second gathers data between 08: 20: 39.000 and 08: 20: 49.500 .. between both there are 21 seconds of difference in which there may be records ...

11/1/2019 9:20:18[2019-01-11 08:20:18,985] DEBUG Executing prepared statement with timestamp value = 2019-01-11 08:17:07.000 end time = 2019-01-11 08:20:18.985 
11/1/2019 9:20:49[2019-01-11 08:20:49,500] DEBUG Executing prepared statement with timestamp value = 2019-01-11 08:20:39.000 end time = 2019-01-11 08:20:49.500

I assume that one of the data is the last record obtained and the other value the timestamp of the moment

I can not find an explanation about this Is the normal operation of the connector? Should you assume that you are not always going to collect all the information?

Robin Moffatt · Accepted Answer · 2020-11-13T16:17:48.403

6

The JDBC connector is not guaranteed to retrieve every message. For that, you need log-based Change Data Capture. For Postgres that is provided by Debezium and Kafka Connect. You can read more about this here.

Disclaimer: I work for Confluent, and wrote the above blog

Edit: This is now a recording available of the above blog too from ApacheCon 2020: https://rmoff.dev/no-more-silos

edited Nov 13 '20 at 16:17

answered Jan 14 '19 at 15:23

Robin Moffatt

30,382
3
65
92

2

thank you very much for your response, I value it positively However, why does the connector not guarantee the transport of each row? neither using the incrementing mode? or timestamp + incrementing? Should I use Debezium definitely? – GeoPamplona Jan 14 '19 at 15:41
2

That's an implicit aspect of a polling-based approach, you cannot prevent that more than one update happens between two polling attempts, in which case the first update won't be captured. Log-based CDC as done by Debezium avoids this, as it obtains all the changes from the DBs append-only log files (disclaimer: I'm contributing to Debezium). – Gunnar Feb 13 '19 at 12:44
2

@Gunnar. If I'm using "Incrementing Column" mode only with a JDBC Connector and I'm polling every 1 hour for example, is there a reason to believe that I will ever miss any new records? – paiego Jun 02 '20 at 16:59
1

(1/2) I agree this is a great question. For example, if the connector runs at 13:30:01 but the most recent row in the table has time 13:29:40, I think you should be fine, as the subsequent run will query `WHERE my_timestamp_col > '...13:29:40'`. Right? – Patrick Szalapski Nov 13 '20 at 16:00
1

(2/2) However, if if the connector runs at 14:30:01 and the most recent row in the table has timestamp 14:30:01, and then after the connector runs a new row is written, also with timestamp 14:30:01. This new row will be missed in the 14:30:01 run, because it didn't exist till a fraction of a second later, and then it will be skipped in the subsequent run, because this later run will have `WHERE my_timestamp_col > '14:30:01'` on it, right? – Patrick Szalapski Nov 13 '20 at 16:03
@PatrickSzalapski Why doesn't the Kafka JDBC connector then not use WHERE my_timestamp_col >= '14:30:01' to also capture the events which appeared at 14:30:01 in between? – MetallicPriest Apr 15 '21 at 08:02
Because the connector supplies the WHERE clause and uses greater than, not greater than or equal. Why? Maybe Robin Moffitt can answer. – Patrick Szalapski Apr 15 '21 at 11:20
True, but at least there should be an option for greater than or equal. In that case, you may get duplicates, but at least you won't miss anything. – MetallicPriest Apr 15 '21 at 18:04

kafka-connect-jdbc does not fetch consecutive timestamp from source

1 Answers1