ExtractField and Parse JSON in kafka-connect sink

Question

I have a kafka-connect flow of mongodb->kafka connect->elasticsearch sending data end to end OK, but the payload document is JSON encoded. Here's my source mongodb document.

{
  "_id": "1541527535911",
  "enabled": true,
  "price": 15.99,
  "style": {
    "color": "blue"
  },
  "tags": [
    "shirt",
    "summer"
  ]
}

And here's my mongodb source connector configuration:

{
  "name": "redacted",
  "config": {
    "connector.class": "com.teambition.kafka.connect.mongo.source.MongoSourceConnector",
    "databases": "redacted.redacted",
    "initial.import": "true",
    "topic.prefix": "redacted",
    "tasks.max": "8",
    "batch.size": "1",
    "key.serializer": "org.apache.kafka.common.serialization.StringSerializer",
    "value.serializer": "org.apache.kafka.common.serialization.JSONSerializer",
    "key.serializer.schemas.enable": false,
    "value.serializer.schemas.enable": false,
    "compression.type": "none",
    "mongo.uri": "mongodb://redacted:27017/redacted",
    "analyze.schema": false,
    "schema.name": "__unused__",
    "transforms": "RenameTopic",
    "transforms.RenameTopic.type":
      "org.apache.kafka.connect.transforms.RegexRouter",
    "transforms.RenameTopic.regex": "redacted.redacted_Redacted",
    "transforms.RenameTopic.replacement": "redacted"
  }
}

Over in elasticsearch, it ends up looking like this:

{
  "_index" : "redacted",
  "_type" : "kafka-connect",
  "_id" : "{\"schema\":{\"type\":\"string\",\"optional\":true},\"payload\":\"1541527535911\"}",
  "_score" : 1.0,
  "_source" : {
    "ts" : 1541527536,
    "inc" : 2,
    "id" : "1541527535911",
    "database" : "redacted",
    "op" : "i",
    "object" : "{ \"_id\" : \"1541527535911\", \"price\" : 15.99,
      \"enabled\" : true, \"tags\" : [\"shirt\", \"summer\"],
      \"style\" : { \"color\" : \"blue\" } }"
  }
}

I'd like to do use 2 single message transforms:

ExtractField to grab object, which is a string of JSON
Something to parse that JSON into an object or just let the normal JSONConverter handle it, as long as it ends up as properly structured in elasticsearch.

I've attempted to do it with just ExtractField in my sink config, but I see this error logged by kafka

kafka-connect_1       | org.apache.kafka.connect.errors.ConnectException:
Bulk request failed: [{"type":"mapper_parsing_exception",
"reason":"failed to parse", 
"caused_by":{"type":"not_x_content_exception",
"reason":"Compressor detection can only be called on some xcontent bytes or
compressed xcontent bytes"}}]

Here's my elasticsearch sink connector configuration. In this version, I have things working but I had to code a custom ParseJson SMT. It's working well, but if there's a better way or a way to do this with some combination of built-in stuff (converters, SMTs, whatever works), I'd love to see that.

{
  "name": "redacted",
  "config": {
    "connector.class":
      "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
    "batch.size": 1,
    "connection.url": "http://redacted:9200",
    "key.converter.schemas.enable": true,
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "schema.ignore": true,
    "tasks.max": "1",
    "topics": "redacted",
    "transforms": "ExtractFieldPayload,ExtractFieldObject,ParseJson,ReplaceId",
    "transforms.ExtractFieldPayload.type": "org.apache.kafka.connect.transforms.ExtractField$Value",
    "transforms.ExtractFieldPayload.field": "payload",
    "transforms.ExtractFieldObject.type": "org.apache.kafka.connect.transforms.ExtractField$Value",
    "transforms.ExtractFieldObject.field": "object",
    "transforms.ParseJson.type": "reaction.kafka.connect.transforms.ParseJson",
    "transforms.ReplaceId.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
    "transforms.ReplaceId.renames": "_id:id",
    "type.name": "kafka-connect",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter.schemas.enable": false
  }
}

Can you add your ElasticSearch configuration? Looks like you're using StringConverter — OneCricketeer, Nov 12 '18 at 23:41
@cricket_007 I added my ElasticSearch sink configuration. Let me know if there's a better way with a different `value.converter` or something. — Peter Lyons, Nov 15 '18 at 18:57
Is there a reason you're not using the JSONConverter? since the payload is actually JSON after all... Can you also add your Mongo source connector? — OneCricketeer, Nov 15 '18 at 19:21
I think I am using it for value.converter, right? Are you talking about key.converter? (I'm very new to kafka-connect and currently totally overwhelmed by all these configurations so I apologize if my questions don't even make sense assuming a clear/complete understanding) — Peter Lyons, Nov 15 '18 at 19:24
I have the same issue which is I guess determined by mongo source connector which hands over to its convertor (whatever would be) a `String` (which happens to contain the mongo JSON record) instead of a `Map` or Kafka `struct`. Till now I concluded that the solution with a custom JSON parser SMT seems to be the best. Please put a link for ParseJson sources if you agree to share it. — Adrian, Apr 14 '19 at 18:17

score 1 · Answer 1 · answered Nov 15 '18 at 22:06

I am not sure about your Mongo connector. I don't recognize the class or the configurations... Most people probably use Debezium Mongo connector

I would setup this way, though

"connector.class": "com.teambition.kafka.connect.mongo.source.MongoSourceConnector",

"key.serializer": "org.apache.kafka.common.serialization.StringSerializer",
"value.serializer": "org.apache.kafka.common.serialization.JSONSerializer",
"key.serializer.schemas.enable": false,
"value.serializer.schemas.enable": true,

The schemas.enable is important, that way the internal Connect data classes can know how to convert to/from other formats.

Then, in the Sink, you again need to use JSON DeSerializer (via the converter) so that it creates a full object rather than a plaintext string, as you see in Elasticsearch ({\"schema\":{\"type\":\"string\").

"connector.class":
  "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",

"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable": false,
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": true

And if this doesn't work, then you might have to manually create your index mapping in Elasticsearch ahead of time so it knows how to actually parse the strings you are sending it

Thanks for your answer! I've tried your settings with 2 combinations of ExtractField. I end up with either still this error or a NullPointerException, which I think is because although now there's a schema for the wrapper JSON object, there's no schema for the `payload.object` JSON encoded string. Basically I think I would need 2 schemas in play. — Peter Lyons, Nov 16 '18 at 18:42
Are you using the Confluent schema registry, by chance? I've not had much experience with the JSONConverter actually to know how well it persists schema information, but Avro is more strictly defined, and the Elastic connector prefers having a defined schema that way. Or, at least in the Mongo connector, maybe try analyze.schema=true so that you might persist the schema as part of the record in some way — OneCricketeer, Nov 16 '18 at 22:26

ExtractField and Parse JSON in kafka-connect sink

1 Answers1