1

I have a kafka-connect flow of mongodb->kafka connect->elasticsearch sending data end to end OK, but the payload document is JSON encoded. Here's my source mongodb document.

{
  "_id": "1541527535911",
  "enabled": true,
  "price": 15.99,
  "style": {
    "color": "blue"
  },
  "tags": [
    "shirt",
    "summer"
  ]
}

And here's my mongodb source connector configuration:

{
  "name": "redacted",
  "config": {
    "connector.class": "com.teambition.kafka.connect.mongo.source.MongoSourceConnector",
    "databases": "redacted.redacted",
    "initial.import": "true",
    "topic.prefix": "redacted",
    "tasks.max": "8",
    "batch.size": "1",
    "key.serializer": "org.apache.kafka.common.serialization.StringSerializer",
    "value.serializer": "org.apache.kafka.common.serialization.JSONSerializer",
    "key.serializer.schemas.enable": false,
    "value.serializer.schemas.enable": false,
    "compression.type": "none",
    "mongo.uri": "mongodb://redacted:27017/redacted",
    "analyze.schema": false,
    "schema.name": "__unused__",
    "transforms": "RenameTopic",
    "transforms.RenameTopic.type":
      "org.apache.kafka.connect.transforms.RegexRouter",
    "transforms.RenameTopic.regex": "redacted.redacted_Redacted",
    "transforms.RenameTopic.replacement": "redacted"
  }
}

Over in elasticsearch, it ends up looking like this:

{
  "_index" : "redacted",
  "_type" : "kafka-connect",
  "_id" : "{\"schema\":{\"type\":\"string\",\"optional\":true},\"payload\":\"1541527535911\"}",
  "_score" : 1.0,
  "_source" : {
    "ts" : 1541527536,
    "inc" : 2,
    "id" : "1541527535911",
    "database" : "redacted",
    "op" : "i",
    "object" : "{ \"_id\" : \"1541527535911\", \"price\" : 15.99,
      \"enabled\" : true, \"tags\" : [\"shirt\", \"summer\"],
      \"style\" : { \"color\" : \"blue\" } }"
  }
}

I'd like to do use 2 single message transforms:

  1. ExtractField to grab object, which is a string of JSON
  2. Something to parse that JSON into an object or just let the normal JSONConverter handle it, as long as it ends up as properly structured in elasticsearch.

I've attempted to do it with just ExtractField in my sink config, but I see this error logged by kafka

kafka-connect_1       | org.apache.kafka.connect.errors.ConnectException:
Bulk request failed: [{"type":"mapper_parsing_exception",
"reason":"failed to parse", 
"caused_by":{"type":"not_x_content_exception",
"reason":"Compressor detection can only be called on some xcontent bytes or
compressed xcontent bytes"}}]

Here's my elasticsearch sink connector configuration. In this version, I have things working but I had to code a custom ParseJson SMT. It's working well, but if there's a better way or a way to do this with some combination of built-in stuff (converters, SMTs, whatever works), I'd love to see that.

{
  "name": "redacted",
  "config": {
    "connector.class":
      "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
    "batch.size": 1,
    "connection.url": "http://redacted:9200",
    "key.converter.schemas.enable": true,
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "schema.ignore": true,
    "tasks.max": "1",
    "topics": "redacted",
    "transforms": "ExtractFieldPayload,ExtractFieldObject,ParseJson,ReplaceId",
    "transforms.ExtractFieldPayload.type": "org.apache.kafka.connect.transforms.ExtractField$Value",
    "transforms.ExtractFieldPayload.field": "payload",
    "transforms.ExtractFieldObject.type": "org.apache.kafka.connect.transforms.ExtractField$Value",
    "transforms.ExtractFieldObject.field": "object",
    "transforms.ParseJson.type": "reaction.kafka.connect.transforms.ParseJson",
    "transforms.ReplaceId.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
    "transforms.ReplaceId.renames": "_id:id",
    "type.name": "kafka-connect",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter.schemas.enable": false
  }
}
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Peter Lyons
  • 142,938
  • 30
  • 279
  • 274
  • Can you add your ElasticSearch configuration? Looks like you're using StringConverter – OneCricketeer Nov 12 '18 at 23:41
  • @cricket_007 I added my ElasticSearch sink configuration. Let me know if there's a better way with a different `value.converter` or something. – Peter Lyons Nov 15 '18 at 18:57
  • Is there a reason you're not using the JSONConverter? since the payload is actually JSON after all... Can you also add your Mongo source connector? – OneCricketeer Nov 15 '18 at 19:21
  • I think I am using it for value.converter, right? Are you talking about key.converter? (I'm very new to kafka-connect and currently totally overwhelmed by all these configurations so I apologize if my questions don't even make sense assuming a clear/complete understanding) – Peter Lyons Nov 15 '18 at 19:24
  • Added my mongodb source connector configuration. – Peter Lyons Nov 15 '18 at 19:27
  • 1
    I have the same issue which is I guess determined by mongo source connector which hands over to its convertor (whatever would be) a `String` (which happens to contain the mongo JSON record) instead of a `Map` or Kafka `struct`. Till now I concluded that the solution with a custom JSON parser SMT seems to be the best. Please put a link for ParseJson sources if you agree to share it. – Adrian Apr 14 '19 at 18:17

1 Answers1

1

I am not sure about your Mongo connector. I don't recognize the class or the configurations... Most people probably use Debezium Mongo connector

I would setup this way, though

"connector.class": "com.teambition.kafka.connect.mongo.source.MongoSourceConnector",

"key.serializer": "org.apache.kafka.common.serialization.StringSerializer",
"value.serializer": "org.apache.kafka.common.serialization.JSONSerializer",
"key.serializer.schemas.enable": false,
"value.serializer.schemas.enable": true,

The schemas.enable is important, that way the internal Connect data classes can know how to convert to/from other formats.

Then, in the Sink, you again need to use JSON DeSerializer (via the converter) so that it creates a full object rather than a plaintext string, as you see in Elasticsearch ({\"schema\":{\"type\":\"string\").

"connector.class":
  "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",

"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable": false,
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": true

And if this doesn't work, then you might have to manually create your index mapping in Elasticsearch ahead of time so it knows how to actually parse the strings you are sending it

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • Thanks for your answer! I've tried your settings with 2 combinations of ExtractField. I end up with either still this error or a NullPointerException, which I think is because although now there's a schema for the wrapper JSON object, there's no schema for the `payload.object` JSON encoded string. Basically I think I would need 2 schemas in play. – Peter Lyons Nov 16 '18 at 18:42
  • Are you using the Confluent schema registry, by chance? I've not had much experience with the JSONConverter actually to know how well it persists schema information, but Avro is more strictly defined, and the Elastic connector prefers having a defined schema that way. Or, at least in the Mongo connector, maybe try analyze.schema=true so that you might persist the schema as part of the record in some way – OneCricketeer Nov 16 '18 at 22:26