org.apache.kafka.connect.errors.DataException: Invalid JSON for array default value: "null"

Question

I am trying to use the confluent Kafka s3 connector using confluent-4.1.1.

s3-sink

"value.converter.schema.registry.url": "http://localhost:8081",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter": "org.apache.kafka.connect.storage.StringConverter"

When I run Kafka connectors for the s3 sink, I get this error message:

ERROR WorkerSinkTask{id=singular-s3-sink-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:172)
org.apache.kafka.connect.errors.DataException: Invalid JSON for array default value: "null"
        at io.confluent.connect.avro.AvroData.defaultValueFromAvro(AvroData.java:1649)
        at io.confluent.connect.avro.AvroData.toConnectSchema(AvroData.java:1562)
        at io.confluent.connect.avro.AvroData.toConnectSchema(AvroData.java:1443)
        at io.confluent.connect.avro.AvroData.toConnectSchema(AvroData.java:1443)
        at io.confluent.connect.avro.AvroData.toConnectSchema(AvroData.java:1323)
        at io.confluent.connect.avro.AvroData.toConnectData(AvroData.java:1047)
        at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:87)
        at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:468)
        at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:301)
        at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:205)
        at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:173)
        at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:170)
        at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:214)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

My Schema contains only 1 array type field and its schema is like this

{"name":"item_id","type":{"type":"array","items":["null","string"]},"default":[]}

I am able to see the deserialized message using the kafka-avro-console-consumer command. I have seen a similar question but in his case, he was using Avro serializer for key also.

./confluent-4.1.1/bin/kafka-avro-console-consumer --topic singular_custom_postback --bootstrap-server localhost:9092  -max-messages 2

"item_id":[{"string":"15552"},{"string":"37810"},{"string":"38061"}]
"item_id":[]

I cannot put the entire output I get from the console consumer as it contains sensitive user information, so I have added the only array type field in my schema.

Thanks in advance.

Can you edit your question to the output you see from the console consumer? — OneCricketeer, Oct 30 '18 at 03:53
@cricket_007 I have added the output of the console consumer. No, I haven't tested it on 5.0 s3 connector as we are using 4.1.1 on production. Is this a version specific issue? — Ayush Chauhan, Oct 30 '18 at 05:01

OneCricketeer · Answer 1 · 2019-04-23T20:37:29.540

0

Same problem as the question that you have linked to.

In the source code, you can see this condition.

  case ARRAY: {
    if (!jsonValue.isArray()) {
      throw new DataException("Invalid JSON for array default value: " + jsonValue.toString());
    }

And the exception can be thrown when the schema type is defined in your case as type:"array", but the payload itself has a null value (or any other value type) rather than actually an array, despite what you have defined as your schema default value. The default is only applied when the items element isn't there at all, not when "items":null

Other than that, I would suggest a schema like so, i.e. a record object, not just a named array, with a default of an empty array, not null.

{
  "type" : "record",
  "name" : "Items",
  "namespace" : "com.example.avro",
  "fields" : [ {
    "name" : "item_id",
    "type" : {
      "type" : "array",
      "items" : [ "null", "string" ]
    },
    "default": []
  } ]
}

edited Apr 23 '19 at 20:37

answered Oct 30 '18 at 19:26

OneCricketeer

179,855
19
132
245

But, we are making sure to send an empty array if the value of items is null, so "items": null can never happen – Ayush Chauhan Oct 31 '18 at 09:31
@cricket_007 I am having the same issue. I will investigate as well next week. But does anyone maybe know why the kafka-avro-console-consumer manages to deserialise what the s3 sink could not? Avro is either valid or not. Am I missing sth? thx! – Vassilis Apr 05 '19 at 16:20
@Vassilis The error is thrown in the Connect API, which the console consumer does not use – OneCricketeer Apr 23 '19 at 20:38

score 0 · Answer 2 · answered Apr 10 '19 at 09:28

The io.confluent.connect.avro.AvroData.defaultValueFromAvro(AvroData.java:1649) is called for the conversion of avro schema of the message you read to the connect sink's internal schema. I believe it is not related to the data of your message. That is why the AbstractKafkaAvroDeserializer can successfully deserialise your message (e.g. via kafka-avro-console-consumer), as your message is a valid avro message. The above exception may occur if your default value is null, while null is not a valid value of your field. E.g.

{
   "name":"item_id",
   "type":{
      "type":"array",
      "items":[
         "string"
      ]
   },
   "default": null
}

I would propose you to remotely debug connect and see what exactly is failing.

org.apache.kafka.connect.errors.DataException: Invalid JSON for array default value: "null"

2 Answers2