What is the best way to read from PgQ to Apache Kafka?

Question

I want to capture data changes from few tables in a huge PostgreSQL database. Initially I planned to use the logical decoding feature with Debezium. But this solution has significant overhead since it's necessary to decode the entire WAL. Another solution uses triggers and PgQ. Is there any general way to integrate PgQ with Kafka or perhaps a Kafka connector for this purpose?

score 1 · Answer 1 · answered Nov 25 '19 at 16:15

1

You either go transaction log, or you go query-based.

Which you use depends on your use of the data. Query-based polls the DB, log-based uses the log (WAL).

I'm interested in your assertion that Debezium has "significant overhead"—have you quantified this? I know there are lots of people using it and it's not usually raised as an issue.

For query-based capture use the Kafka Connect JDBC source connector.

You can see pros and cons of each approach here: http://rmoff.dev/ksny19-no-more-silos

answered Nov 25 '19 at 16:15

Robin Moffatt

30,382
3
65
92

Robin, thanks a lot for your answer! My assertion about overhead is about the database and it's logical decoding mechanics, not Debezium itself. For example, there are 1k tables in DB, but I need only 10. In this case, it's necessary to decode WAL from all 1k tables, since WAL is just a stream of bytes and it cannot be filtered without decoding. And, as I understand it, decoding process is not free (e.g. CPU cost), but useful work is very low (~10/1000 = 1%). – Iskuskov Alexander Nov 26 '19 at 09:47
What you say makes sense. I guess if you want the data, there's going to be an overhead at some point. You've got an overhead of triggers, you've got an overhead of poll-based querying, you've got the overhead of log-reading. You have to pick the one most acceptable :) – Robin Moffatt Nov 26 '19 at 09:56
yes, I agree :) I think, in terms of efficiency, the best approach is trigger-based. In this case, is it ok to write your own source connector from PgQ? Or is it simpler to use a producer? – Iskuskov Alexander Nov 26 '19 at 11:15
2

Let me challenge you on triggers being most efficient :) Main problem is they have to run as part of write transactions, i.e. adding latency to your online queries. WAL reading on the contrary happens async and generally is very low overhead. Imagine it like another replication client, would you consider enabling replication any significant overhead? Sure, it might add a few CPU cycles, but it's async, out of the way of your transactions. With wal2json and pgoutput you also can also apply table filters right on the database side itself, so only relevant changes get streamed to Kafka Connect. – Gunnar Dec 03 '19 at 08:51

What is the best way to read from PgQ to Apache Kafka?

1 Answers1