MarkLogic - Design suggestion for efficient Batch processing

Question

MarkLogic version 9.0-6.1

We have implemented two patterns for batch ingestion.

Pattern 1 : MLCP

Pattern 2: Informatica(or NiFi) reading an NDJSON file and making MarkLogic REST API PUT calls for each JSON in the NDJSON file

Our production box is a 3 node cluster with 72 cores.

Our MLCP jobs run pretty well with default thread count of 4 and at the maximum we have 3 MLCP jobs runnning in parallel, ensuring that at least 60 cores are available for Real Time (or Near Real Time) processing at any point of time.

However, I am not sure how the Informatica/NiFi batch jobs use up the cores. Like MLCP, is there a way to limit the cores used by Informatica/NiFi jobs to ensure that sufficient cores/threads are available for Real Time processing?

As we add more and more processes to production, we see that there is a big increase in Time-out errors for Real-Time REST API PUT/GET calls. These calls typically take only few milliseconds(when we run them individually), so I am guessing that the contention for resources is causing the time-outs.

We have an option to scale-out nodes in the cluster, but this situation got me to think that MLCP is a better design than REST PUT calls for batch ingestion as we have better control over limiting the cores/threads used by each batch process, ensuring sufficient cores being available for Real-Time processing. Is there a way we can control/limit the resources used by NiFi, if used for batch ingestion?

Please suggest. Thanks in advance!

This sounds like an Informatica/NiFi question, since the problem seems to be about how to control the number of concurrent requests it makes to ML. — wst, Oct 24 '19 at 16:09

score 0 · Accepted Answer · answered Oct 25 '19 at 15:55

Looks like this is an issue when using Informatica, as Informatica does not have a native connector for MarkLogic. Hence we have to make a REST PUT call for each JSON document in the file. Also, threading is controlled by informatica, so the developer has little control in limiting the max thread count.

However, NiFi has native MarkLogic connector which uses java DM SDK instead of REST API calls to ingest data. As we can see, this is more efficient, both in terms of performance and scalability.

The solution for Informatica customers seems to be making a native connector available for MarkLogic (just like MongoDB, SalesForce etc.).

MarkLogic - Design suggestion for efficient Batch processing

1 Answers1