0

I'm trying to use Kafka connect sink to write files from Kafka to HDFS.

My properties looks like:

connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
flush.size=3
format.class=io.confluent.connect.hdfs.parquet.ParquetFormat
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
schema.compatability=BACKWARD
key.converter.schemas.enabled=false
value.converter.schemas.enabled=false
schemas.enable=false

And When I'm trying to run the connector I got the following exception:

org.apache.kafka.connect.errors.DataException: JsonConverter with schemas.enable requires "schema" and "payload" fields and may not contain additional fields. If you are trying to deserialize plain JSON data, set schemas.enable=false in your converter configuration.

I'm using Confluent version 4.0.0.

Any suggestions please?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Ya Ko
  • 509
  • 2
  • 4
  • 19
  • https://stackoverflow.com/a/45940013/2308683 – OneCricketeer Aug 14 '18 at 14:10
  • 1
    @cricket_007, If my Json isn't with "schema" and "payload", How can I write Parquet file anyway? – Ya Ko Aug 14 '18 at 14:39
  • 1
    I don't think you can. Parquet requires a Schema and last time I checked, the Kafka Connect code from Confluent uses the Avro libraries to help convert the Kafka message into Parquet files – OneCricketeer Aug 14 '18 at 14:41
  • Ok. So I need to use key/value conveter as Avro and format.class as Parquet? Where should I need to configure the Avro schema? – Ya Ko Aug 14 '18 at 14:46
  • 1
    You would need to produce Avro into the topic to begin with, using the Schema Registry. Otherwise, you must add the schema field to the JSON message. Alternatively, use JSONFormat rather than Parquet, then use Hive, Spark, whatever to convert to Parquet later. In any option you choose, the schema needs defined, but that's not a property that is added in the Connect framework – OneCricketeer Aug 14 '18 at 14:50
  • So just to understand - If I want to use key/value Avro convertors I need to produce Avro data in my Kafka topic? And If I want to convert JSON to parquet I must JSON with schema and payload format? I cant read json as string and provide schema for it and then convert it to parquet? – Ya Ko Aug 14 '18 at 14:56
  • 1
    Sounds like you got it. There's more options like using Kafka Streams or KSQL to convert the JSON topic into an Avro topic, and then using Connect, but that assumes that you cannot change the producer code and are able to reliably deploy those services – OneCricketeer Aug 14 '18 at 15:02
  • @cricket_007, That's right, If I can change the producer code to make JSON to be with "schema", "payload" its should work with parquet right? Thank you very much!! – Ya Ko Aug 15 '18 at 06:14
  • 1
    I haven't tried it, but that's what the error is trying to tell you – OneCricketeer Aug 15 '18 at 14:31
  • https://rmoff.net/2017/09/06/kafka-connect-jsondeserializer-with-schemas.enable-requires-schema-and-payload-fields/ – Sajjan Kumar Mar 10 '23 at 10:20

1 Answers1

0

My understanding of this issue is that if you set schemas.enable=true, you tell kafka that you would like to include the schema into messages that kafka must transfer. In this case, a kafka message does not have a plain json format. Instead, it first describes the schema and then attaches the payload (i.e., the actual data) that corresponds to the schema (read about AVRO formatting). And this leads to the conflict: On the one hand you've specified JsonConverter for your data, on the other hand you ask kafka to include the schema into messages. To fix this, you can either use AvroConverter with schemas.enable = true or JsonCOnverter with schemas.enable=false.

D.M.
  • 131
  • 1
  • 4
  • I don't think AvroConverter cares about `schemas.enable` setting because it always assumes the schema is available and in the schema registry – OneCricketeer Aug 14 '18 at 14:08