1

I write messages to kafka from csv files. My producer says that all data produced to Kafka topic.

Along with that, I use apache nifi as consumer to kafka topic (ConsumeKafka_2_0 processor).

If I produced data to kafka into one stream - all is OK, but if I tried to use multiple producer with multiple files parallel, I lose a lot of rows.

Core of my producer following:

def produce(for_produce_file):
    log.info('Producing started')
    avro_schema = avro.schema.Parse(open(config.AVRO_SCHEME_FILE).read())
    names_from_schema = [field.name for field in avro_schema.fields]

    producer = Producer({'bootstrap.servers': config.KAFKA_BROKERS,
                     'security.protocol': config.SECURITY_PROTOCOL,
                     'ssl.ca.location': config.SSL_CAFILE,
                     'sasl.mechanism': config.SASL_MECHANISM,
                     'sasl.username': config.SASL_PLAIN_USERNAME,
                     'sasl.password': config.SASL_PLAIN_PASSWORD,
                     'queue.buffering.max.messages': 1000000,
                     'queue.buffering.max.ms': 5000})
    try:
        file = open(for_produce_file, 'r', encoding='utf-8')
    except FileNotFoundError:
        log.error(f'File {for_produce_file} not found')
    else:

        produced_str_count = 0
        csv_reader = csv.DictReader(file, delimiter="|", fieldnames=names_from_schema)
        log.info(f'File {for_produce_file} opened')

        for row in csv_reader:
            record = dict(zip(names_from_schema, row.values()))
            while True:
                try:
                    producer.produce(config.TOPIC_NAME, json.dumps(record).encode('utf8'))
                    producer.poll(0)
                    produced_str_count += 1
                    break
                except BufferError as e:
                    log.info(e)
                    producer.poll(5)
                except KafkaException as e:
                    log.error('Kafka error: {}, message was {} bytes'.format(e, sys.getsizeof(json.dumps(record))))
                    log.error(row)
                    break
        producer.flush()
        log.info(f'Producing ended. Produced {produced_str_count} rows')

Screenshot from Nifi Processor properties: enter image description here

enter image description here

Kafka contains 3 nodes, topic replication factor - 3

Maybe the problem is that the producer writes faster than the consumer reads, and the data is deleted when a certain block is overflowed?

Please give me advice.

  • Please share the topic configs. Also are you leveraging acks in producer config. I don’t see that in your code snippet. – Suman Aug 21 '20 at 18:34
  • Also share nifi spec, and talk about considerations you have done in flow for concurrency. How much flowfiles are moving through system when you are experiencing data loss, etc.. – steven-matison Aug 21 '20 at 18:39
  • Topic: transactions PartitionCount: 1 ReplicationFactor: 3 Configs: Topic: transactions Partition: 0 Leader: 2 Replicas: 2,3,1 Isr: 3,2,1 – Grigory Skvortsov Aug 21 '20 at 18:44
  • I just added acks into producer, will test it – Grigory Skvortsov Aug 21 '20 at 18:44
  • In my case, parallelism is running a producer on a set of files. In nifi I read from kafka topic with only one partitions, after that, I transform messages to one ORC files and stored it into hive. – Grigory Skvortsov Aug 21 '20 at 18:48
  • Sometimes I have such messages into kafka logs "Received LeaderAndIsrRequest with correlation id 1 from controller 2 epoch 21 for partition __transaction_state-43 (last update controller epoch 21) but cannot become follower since the new leader -1 is unavailable. (state.change.logger) Could there be a problem here? – Grigory Skvortsov Aug 21 '20 at 18:53
  • Don't know if you fixed your problem. You do realize that you are reading up to 100000 records into 1 flowfile. – Christoph Bauer Aug 25 '20 at 09:02

0 Answers0