5

We are running kafka hdfs sink connector(version 5.2.1) and needs HDFS data to be partitioned by multiple nested fields.The data in topics is stored as Avro and has nested elements.How ever connect cannot recognize the nested fields and throws an error that the field cannot be found.Below is the connector configuration we are using. Doesn't hdfs sink connect support partitioning by nested fields ?.I can partition by using non nested fields

{
            "connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
            "topics.dir": "/projects/test/kafka/logdata/coss",
            "avro.codec": "snappy",
            "flush.size": "200",
            "connect.hdfs.principal": "test@DOMAIN.COM",
            "rotate.interval.ms": "500000",
            "logs.dir": "/projects/test/kafka/tmp/wal/coss4",
            "hdfs.namenode.principal": "hdfs/_HOST@HADOOP.DOMAIN",
            "hadoop.conf.dir": "/etc/hdfs",
            "topics": "test1",
            "connect.hdfs.keytab": "/etc/hdfs-qa/test.keytab",
            "hdfs.url": "hdfs://nameservice1:8020",
            "hdfs.authentication.kerberos": "true",
            "name": "hdfs_connector_v1",
            "key.converter": "org.apache.kafka.connect.storage.StringConverter",
            "value.converter": "io.confluent.connect.avro.AvroConverter",
            "value.converter.schema.registry.url": "http://myschema:8081",
            "partition.field.name": "meta.ID,meta.source,meta.HH",
            "partitioner.class": "io.confluent.connect.storage.partitioner.FieldPartitioner"
  }
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
rookie
  • 386
  • 6
  • 19

1 Answers1

3

I added nested field support for the TimestampPartitioner, but the FieldPartitioner still has an outstanding PR

https://github.com/confluentinc/kafka-connect-storage-common/pull/67

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • I did see the TimeStampPartitioner and assumed it was done for FieldPartitioner.This looks like a very common use case to me – rookie May 02 '19 at 20:25
  • Hey @cricket_007, nice PR. Hope gets approved soon. This fix could be also used in s3-sink.properties "partition.field.name", for avro files, with multiple partition field names? How would it be used in .properties file? Thank you – Julia Bel Feb 18 '20 at 18:13
  • @Julia Yes, it can be used in both HDFS and S3. It should work for any file type, and .properties and JSON files share the same key value pairs. Note: Connect-Distributed mode (JSON configs) is preferred – OneCricketeer Feb 18 '20 at 18:19
  • Thank you, @cricket_007. Can I use anything similar (in .properties file) or should I use another format: partition.field.name= 'field1', 'field2'? I wanted to have my partition like field1 > field2. Thanks again! – Julia Bel Feb 18 '20 at 18:23
  • @JuliaBel That PR only does a single field, not multiple – OneCricketeer Feb 18 '20 at 18:27
  • Thank you, @cricket_007. I will continue looking for this answer in a topic more related to it. – Julia Bel Feb 19 '20 at 14:36
  • @Jul Actually, it might do multiple. Sorry, it's been 2 years since I wrote that code https://github.com/confluentinc/kafka-connect-storage-common/blob/master/partitioner/src/main/java/io/confluent/connect/storage/partitioner/FieldPartitioner.java#L40 – OneCricketeer Feb 19 '20 at 14:46
  • Don't worry :) I saw this implementation, that is why my question. Couldn't find an example to implement it in an .properties so far, only for .json. – Julia Bel Feb 19 '20 at 15:11
  • 1
    @Jul as mentioned, the key values are the same. Replace colon with equals drop quotes on the key. Everything is the same – OneCricketeer Feb 19 '20 at 15:21
  • So what is the status: can we or not partition by multiple fields using KAFKA Connect? Not clear to me still. – thebluephantom Apr 18 '20 at 10:30
  • 1
    @thebluephantom You can. Pull the PR and build it – OneCricketeer Apr 19 '20 at 11:19
  • OK, nice to know. Cheers there in TX. – thebluephantom Apr 19 '20 at 11:20