1

We have been trying to create a kafka consumer that tries to consume data about 2.7tb/hour in 60 partitions from other kafka cluster.

So far we have have managed to consume roughly 2tb's of data/hour and not able to catch up with the goal(2.7).

The cluster that we are consuming from has a data retention/deletion rate for storage issues so we need to consume that amount of data under 3 minutes.

Details, We are consuming data on 6 machines with 60 partitions.

import java.io.*;
import java.net.InetSocketAddress;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.time.Instant;
import java.util.*;
import javax.json.*;
import java.sql.Timestamp;

import org.apache.hadoop.HadoopIllegalArgumentException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileUtil;
import com.google.gson.Gson;
import com.google.gson.JsonElement;
import com.google.protobuf.util.JsonFormat;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hdfs.HAUtil;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.hadoop.security.UserGroupInformation;

public class NotificationConsumerThread implements Runnable {

    private final KafkaConsumer<byte[], byte[]> consumer;
    private final String topic;

    public NotificationConsumerThread(String brokers, String groupId, String topic) {
        Properties prop = createConsumerConfig(brokers, groupId);
        this.consumer = new KafkaConsumer<>(prop);
        this.topic = topic;
        this.consumer.subscribe(Arrays.asList(this.topic));
    }

    private static Properties createConsumerConfig(String brokers, String groupId) {
        Properties props = new Properties();
        props.put("bootstrap.servers", brokers);
        props.put("group.id", groupId);
        props.put("enable.auto.commit", "true");
        props.put("auto.commit.interval.ms", "1000");
        props.put("session.timeout.ms", "120000");
        props.put("request.timeout.ms", "120001");
        props.put("max.poll.records", "280000");
        props.put("fetch.min.bytes", "1");
        props.put("max.partition.fetch.bytes", "10000000");
        props.put("auto.offset.reset", "latest");
        props.put("receive.buffer.bytes", "15000000");
        props.put("send.buffer.bytes", "1500000");
        props.put("heartbeat.interval.ms", "40000");
      //  props.put("max.poll.interval.ms", "420000");
        props.put("key.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer");
        props.put("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer");

        return props;
    }



    @Override
    public void run() {
        try {
            Configuration confHadoop = new Configuration();
            
            confHadoop.addResource(new Path("redacted"));
            confHadoop.addResource(new Path("redacted"));
            confHadoop.setBoolean("dfs.support.append" ,true);


           
            confHadoop.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
            confHadoop.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());
            confHadoop.set("hadoop.security.authentication","kerberos");
            confHadoop.set("dfs.namenode.kerberos.principal.pattern", "redacted");
            UserGroupInformation.setConfiguration(confHadoop); UserGroupInformation.loginUserFromKeytab("redacted", "redacted");

            FileSystem fileHadoop1 = FileSystem.get(confHadoop);
            StringBuffer jsonFormat3 = new StringBuffer();

            while (true) {
                String jsonFormat;
                String jsonFormat1;
                String jsonFormat2;


                DateFormat dateFormat = new SimpleDateFormat("yyyyMMddHH");
                dateFormat.toString();

                Date date = new Date();


                ConsumerRecords<byte[], byte[]> records = consumer.poll(3000);


                for (ConsumerRecord<byte[], byte[]> record : records) {


                    FlowOuterClass.Flow data = FlowOuterClass.Flow.parseFrom(record.value());
                    jsonFormat = JsonFormat.printer().print(data);
                    jsonFormat1 = jsonFormat.replaceAll("\\n", "");


                    JsonObject jsonObject1 = Json.createReader(new StringReader(jsonFormat1)).readObject();
                    Timestamp ts = new Timestamp(Long.parseLong(jsonObject1.getString("xxxx")));


                    date = new Date(ts.getTime());
                    jsonFormat2 = jsonFormat1.substring(0, jsonFormat1.length() - 1) + ", " + "\"xxxxx\"" + ": " + "\"" + dateFormat.format(date) + "\"" + "}\n";
                    jsonFormat3.append(jsonFormat2);

                }

                    String jsonFormat4 = jsonFormat3.toString();

                    if(jsonFormat4.length()>100000000) {
                        FSDataOutputStream stream = fileHadoop1.create(new Path("redacted-xxxxx" + dateFormat.format(date) + "/" + UUID.randomUUID().toString() + ".json"));
                        stream.write(jsonFormat4.getBytes());
                        stream.close();


                        
                        jsonFormat3.delete(0, jsonFormat3.length());
                    }
                }



        } catch (Exception e) {
            System.out.println(e);
        }
        consumer.close();
    }
}

here's the lag status: enter image description here

We could not find any solution on the internet so we'd be glad to know the best practice on how to consume these large amounts of data with a kafka consumer.

thanks!

Efe
  • 29
  • 3

1 Answers1

0

There are a couple of things that you may try to see if you are able to catch up with the original rate of message being produced with minimum latency.

  1. You should increase the number of consumers in the consumer group. With the image that you have posted in question, I can see there are 10 consumers running on 6 machines. If your machines are capable of running more number of consumers, then you should probably consider increasing the number of consumers. Please note that it is better if you can make the number of consumers increased to either 12,15,20,30. This is because we want all the consumers to get an equal number of partitions from the topic. So the idea is the number of the consumer should be a factor of 60.(number of partition in the topic from which you are consuming)
  2. You have tried to increase the number of records while polling by changing max.poll.records to 280000. Please note that this config would only work when you adjust two other configs in a similar way. You would need to change max.partition.fetch.bytes and fetch.max.bytes at a proportional rate. I see that you have tried to change max.partition.fetch.bytes to 10000000(10MB) approx. You should also consider adjusting this value fetch.max.bytes. So, in short, you will need to adjust all these values in the right proportion. Please go through this, you may find this useful. increase number of message in the poll
  3. This is the last step that you can consider if the above two approaches are not working. Since we know that the partition in Kafka decides the degree of parallelism that you could achieve. You may consider increasing the number of partition in the topic from which you are consuming (change it to 120 or some bigger number from the current 60 partitions)

I hope this helps.

Ajay Kr Choudhary
  • 1,304
  • 1
  • 14
  • 23