2

I'm calling Kafka using the confluent REST API proxy. I'm reading a CSV file, creating an object out of all the records there (about 4 million records) and sending a request to the REST proxy. I keep on getting an OutOfMemory exception.

The exact exception message is:

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "kafka-producer-network-thread | producer-81"

I've just one instance of the REST proxy server, hosted as a docker container. The environment variable is set to:

JAVA_OPTIONS=-Xmx1g

Other configs:

CPU - 1 Memory - 1024

It processes about 1,00,000 before crashing. I've tried scaling it to 4 instances with increasing CPU to 3 and memory to 2046 mb too. It then processes about 5,00,000 records.

After reading the csv, I'm calling Kafka endpoint in batches of 5k records. That is written in Node. Here's the Node code

fs.createReadStream(inputFile)
  .pipe(parser({skip_lines_with_error: true}))
  .on('data', (records) => {
        country.push({ 'value' : {
            country: records[0],
            capital: records[1]
            }
        });

        if (country.length > 5000) {
            batch++;
            callKafkaProxy(country).then((rec) => {
                console.log(`'Batch done!'`);
            }).catch((reason) => {
                console.log(reason);
            });
            country = [];
        }
    })
    .on('end', () => {
        console.log('All done!');
    });
function callKafkaProxy(records) {
    const urlAndRequestOptions = {
        url: 'http://kafka-rest-proxy.com/topics/test-topic',
        headers: {
            'content-type' : 'application/vnd.kafka.json.v2+json',
            'Accept' : 'application/vnd.kafka.v2+json'
        }
    };
let recordsObject = {records: records};
//request here is a wrapper on the http package. 
return request.post(urlAndRequestOptions, recordsObject);

I feel like I'm missing some configurations which should help solve this without increasing the number of instances > 1.

Any help will be appreciated.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Gaurav Goenka
  • 152
  • 12
  • Can you give a snippet of your code how you batch records to Kafka? It seems that you do not use node streams or doing it in something wrong asynchronous way. – Žilvinas Jocius Apr 03 '19 at 08:49
  • Just added @ŽilvinasJocius, you can review it. – Gaurav Goenka Apr 03 '19 at 08:58
  • 1
    Is Kafka rest client really suits your needs? What about client which communicates without extra layers as http but with tcp/ip straight? If you would find such client which supports streams your task would execute much faster without such issues. – Žilvinas Jocius Apr 03 '19 at 09:22
  • 1
    The rest client is in use in a lot of places in production right now. However, currently, it produces messages lower by a few orders of magnitude. Producing to kafka directly is possible, but not convenient and prohibits code-reusage (since its already in use for other apps). – Gaurav Goenka Apr 03 '19 at 09:31
  • Did you understand the problem I tried to explain in my answer below? – Žilvinas Jocius Apr 03 '19 at 09:32
  • code reuse should be possible via RequireJS, no? – OneCricketeer Apr 03 '19 at 20:51
  • @cricket_007, its a rest client. We don't need requireJS or anything. A HTTP POST request does the job. – Gaurav Goenka Apr 04 '19 at 05:56

2 Answers2

1
.on('data', () => {}); ... 

1. It does not handle backpressure. Create writable stream, which will handle your batching process. Then just use pipe.

inputStream
    .pipe(parser)
    .pipe(kafka)

Then analysing these lines:

if (country.length > 5000) {
        batch++;
        callKafkaProxy(country).then((rec) => {
            console.log(`'Batch done!'`);
        ).catch((reason) => {
            console.log(reason);
        });
        country = [];
     }
  1. Your callKafkaProxy is asynchronous, that is why your country array is always filled, no matter the result of callKafkaProxy function. Country array keeps filling and keeps making requests. You can make sure by console logging after batch++. You will see that you are initiating lots of requests and Kafka will respond much slower than you are making requests.

Solution:

  1. Create writable stream.
  2. pipe data to it from your parser. input.pipe(parser).pipe(yourJustCreatedKafkaWritableStream)
  3. Let your writable stream to push countries to array and callback when you are ready to receive other record. When you reach your edge (if countries.length > 5000) then make request to kafka and wait for the response and only then give callback. In this way your stream will be adaptive. You should read more about node streams and their power. But remember, with great power comes great responsibility, in which case you have to design code carefully to avoid such memory leaks.
  • I understand #2. It might need to be a synchronous call or with some delay between calls. I am still not following #1. What do you mean by backpressure and what is `.pipe(kafka)`? – Gaurav Goenka Apr 03 '19 at 09:41
  • Do not use random delay. The problem is that .on('data') event listener does not wait till you do something and it keeps emitting. That causes your code to always push data to kafka. your check if(country.length > ...) is superfluous, because 'data' event keeps adding records to your array. So if you would debug it, you will see that you are pushing records to kafka no matter the response. You should do it in a stream fashion. Read file -> parse -> push to kafka. Kafka is bottleneck here that is why you have to create writable stream to tell others to work slower. – Žilvinas Jocius Apr 03 '19 at 09:46
  • Thanks. I understand what you're saying. I need to see how I can use streams. The csv-parser is available to be used as streams. I'm going to try it out and see if that solves my problem. – Gaurav Goenka Apr 03 '19 at 11:01
  • thanks for pointing me in the right direction. I've gotten it to work using just `fs.createReadStream`. I'll post an answer with the details, so anyone who turns to this question can find the implementation useful. – Gaurav Goenka Apr 05 '19 at 05:54
0

With help from Zilvinas's answer, I understood how I could harness streams to send data in batches. Here's a solution:

var stream = fs.createReadStream(file)
                        .pipe(es.split())
                        .pipe(es.mapSync(function (line) {

                            if (line.length) {
                                //read your line and create a record message
                            }

                            //put 5000 in a config constant
                            if (records.length === 5000) {
                                stream.pause();
                                logger.debug(`Got ${records.length} messages. Pushing to Kafka...`);
                                postChunkToKafka(records).then((response) => {     
                                  records = [];
                                  stream.resume();
                                });
                            }
Gaurav Goenka
  • 152
  • 12