9

I am using a 3rd party CDC tool that replicates data from a source database into Kafka topics. An example row is shown below:

{  
   "data":{  
      "USER_ID":{  
         "string":"1"
      },
      "USER_CATEGORY":{  
         "string":"A"
      }
   },
   "beforeData":{  
      "Data":{  
         "USER_ID":{  
            "string":"1"
         },
         "USER_CATEGORY":{  
            "string":"B"
         }
      }
   },
   "headers":{  
      "operation":"UPDATE",
      "timestamp":"2018-05-03T13:53:43.000"
   }
}

What configuration is needed in the sink file in order to extract all the (sub)fields under data and headers and ignore those under beforeData so that the target table in which the data will be transferred by Kafka Sink will contain the following fields:

USER_ID, USER_CATEGORY, operation, timestamp

I went through the transformation list in confluent's docs but I was not able to find how to use them in order to achieve the aforementioned target.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Giorgos Myrianthous
  • 36,235
  • 20
  • 134
  • 156

2 Answers2

9

I think you want ExtractField, and unfortunately, it's a Map.get operation, so that means 1) nested fields cannot be gotten in one pass 2) multiple fields need multiple transforms.

That being said, you might to attempt this (untested)

transforms=ExtractData,ExtractHeaders
transforms.ExtractData.type=org.apache.kafka.connect.transforms.ExtractField$Value
transforms.ExtractData.field=data
transforms.ExtractHeaders.type=org.apache.kafka.connect.transforms.ExtractField$Value
transforms.ExtractHeaders.field=headers

If that doesn't work, you might be better off implementing your own Transformations package that can at least drop values from the Struct / Map.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • You could take a look at Debezium's `io.debezium.transforms.UnwrapFromEnvelope` transform, and modify that to suit your purpose (and then share it with the community as other users of that 3rd party source CDC tool will have the same requirements) – Robin Moffatt May 11 '18 at 08:18
  • @cricket_007 Doesn't seem to work. I'm getting a `NullPointerException `when I try to extract two fields. – Giorgos Myrianthous May 16 '18 at 07:05
  • Yeah, figured as much. The first field probably extracts fine, but that's the only remaining object, so the second field can't be extracted. https://github.com/apache/kafka/blob/trunk/connect/transforms/src/main/java/org/apache/kafka/connect/transforms/ExtractField.java – OneCricketeer May 16 '18 at 12:59
  • Maybe try doing two HoistField transforms to some temp Struct, then use ExtractField on that. Otherwise, like I said, you're going to have to write your own, and add it to the classpath – OneCricketeer May 16 '18 at 13:02
8

If you're willing to list specific field names, you can solve this by:

  1. Using a Flatten transform to collapse the nesting (which will convert the original structure's paths into dot-delimited names)
  2. Using a Replace transform with rename to make the field names be what you want the sink to emit
  3. Using another Replace transform with whitelist to limit the emitted fields to those you select

For your case it might look like:

  "transforms": "t1,t2,t3",
  "transforms.t1.type": "org.apache.kafka.connect.transforms.Flatten$Value",
  "transforms.t2.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
  "transforms.t2.renames": "data.USER_ID:USER_ID,data.USER_CATEGORY:USER_CATEGORY,headers.operation:operation,headers.timestamp:timestamp",
  "transforms.t3.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
  "transforms.t3.whitelist": "USER_ID,USER_CATEGORY,operation,timestamp",
Marty Woodlee
  • 198
  • 1
  • 4
  • Can we also flatten the ArrayList in JSON data? I'm facing the below issue in my elastic sink connector while flattening the ArrayList. **Caused by: org.apache.kafka.connect.errors.DataException: Flatten transformation does not support class java.util.ArrayList for record without schemas** – Vishal Dhanani Feb 21 '22 at 22:40
  • The above solution is not working when you have any Arraylist on your data. Do you know how to tackle the Arraylist flatten and blacklist/whitelist the Array? – Vishal Dhanani Apr 06 '22 at 15:05