0

So I have a Confluent Kafka JDBC connector set up. First I start up a schema registry such as

./bin/schema-registry-start ./etc/schema-registry/schema-registry.properties

This is the schema-registery.properties file

listeners=http://0.0.0.0:8081
kafkastore.connection.url=zookeeperhost:2181
kafkastore.bootstrap.servers=PLAINTEXT://kafkahost:9092
kafkastore.topic=_schemas
debug=false

Next I start up a standalone connector like this

./bin/connect-standalone ./etc/schema-registry/connect-avro-standalone.properties ./jdbc-source.properties

connect-avro-standalone.properties is

bootstrap.servers=kafkahost:9092

key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081

internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false

offset.storage.file.filename=/tmp/connect.offsets
plugin.path=share/java

jdbc-source.properties is

name=jdbc_source_oracle
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
connection.url=jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=host)(PORT=port))(CONNECT_DATA=(SERVER=dedicated)(SID=server)))
connection.user=xxx
connection.password=xxx
table.whitelist=table1, table2
mode=bulk
topic.prefix=my_topic
query=select * from table1 t1 join table1 t2 on t2.id = t1.id where t2.entereddate >='19-FEB-2019' and t2.entereddate <= '23-FEB-2019'

The query I am using is only for testing purposes, the real query I want to use will implement the incrementing mode and will contain no where clause.

Now this manages to publish data into the topic but with some weird stuff going on. First the IDs are saved in an unreadable format. Just empty square. Second some fields that are populated in the database are saving as null in the topic. And third, whenever I try to change the date in the query in the JDBC source file, nothing happens. It still contain the same messages I published the first time it ran, as in nothing in the kafka topic is updated no many how many times I change the query.

Can anyone help me?

EDIT

What I want to do is consume the data through pyspark code. Here's the code on how I am doing it

from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .appName("data streaming app")\
    .getOrCreate()


data_raw = spark.readStream\
    .format("kafka")\
    .option("kafka.bootstrap.servers", "kafkahost:9092")\
    .option("subscribe", "my_topic")\
    .load()

query = data_raw.writeStream\
    .outputMode("append")\
    .format("console")\
    .option("truncate", "false")\
    .trigger(processingTime="5 seconds")\
    .start()\
    .awaitTermination()

I've also consume the data using the kafka-avro-console-consumer using this command

./bin/kafka-avro-console-consumer \
--bootstrap-server kafkahost:9092 \
--property print.key=true \
--from-beginning \
--topic my_topic

Both of these give me weird results

Here's what pyspark code is giving me enter image description here

and this is what using the avro console is giving me

enter image description here

Blocking out some fields and text as it may contain company sensitive information.

anonuser1234
  • 511
  • 2
  • 11
  • 24
  • 1
    Not clear how you are consuming the data. You need to use `kafka-avro-console-consumer`. Also, I believe timestamp incrementing mode needs at least second time precision, not string date columns – OneCricketeer Mar 05 '19 at 01:22
  • BTW if you're using `query` you don't need `table.whitelist`. – Robin Moffatt Mar 05 '19 at 06:53
  • Also see https://www.confluent.io/blog/kafka-connect-deep-dive-jdbc-source-connector – Robin Moffatt Mar 05 '19 at 06:54
  • @cricket_007, I updated my post with information on how I am consuming from the topic, also i am not using incrementing mode in this case. I am using bulk mode so that I can test out the functionality which is why I am using string data columns. Does this matter? – anonuser1234 Mar 05 '19 at 14:48
  • @RobinMoffatt I tried not using table.whitelist before but I got this exception `java.sql.SQLException: Invalid column type: getTimestamp not implemented for class oracle.jdbc.driver.T4CClobAccessor` I also looked through that link but cannot find anything that could help me – anonuser1234 Mar 05 '19 at 14:51
  • I've answered the points that I can, but without specific data and schema information it's hard to give much more detail around the issue. – Robin Moffatt Mar 05 '19 at 15:54

1 Answers1

1

If you're consuming Avro from Spark you'll need to use the correct deserializer.

You're seeing bytes in your Avro data from the console then it's down to the handling of decimals/numerics, as detailed here.

You can read more about Kafka Connect and serialisation alternatives to Avro (including JSON) here.

Robin Moffatt
  • 30,382
  • 3
  • 65
  • 92
  • This is for Scala, not sure if it will work for python. Is there a way for me to not use Avro when using the Kafka Connector? Or a Schema in general? – anonuser1234 Mar 05 '19 at 17:01
  • Yes you can use JsonConverter (or even StringConverter although why you'd want to do that is beyond me). I've added a link to my answer that will help you explain this – Robin Moffatt Mar 05 '19 at 17:27
  • Note that you *can* consume Avro-serialised messages with Python: https://docs.confluent.io/5.0.0/clients/confluent-kafka-python/index.html#avro / https://github.com/confluentinc/confluent-kafka-python – Robin Moffatt Mar 05 '19 at 17:30
  • @anonuser1234 You might want to try this for Spark output to the console https://github.com/AbsaOSS/ABRiS/ – OneCricketeer Mar 06 '19 at 19:26
  • @cricket_007, thanks but that looks like its only for scala. I am working with PySpark – anonuser1234 Mar 06 '19 at 21:27
  • @anonuser1234 It's possible to import any Scala/Java Spark packages into PySpark. Just requires more configuration in the code. For example, the Kafka Streaming package is written in Scala, and you are using it in PySpark – OneCricketeer Mar 08 '19 at 16:46
  • @cricket_007 Is there an example of this somewhere I can see? – anonuser1234 Mar 08 '19 at 22:21