0

I'm developing a Hazelcast Jet batch job using the core API (not the pipeline API). One of the intermediate vertices of my DAG must do a map operation, but would benefit from mapping a batch of items at a time, rather than one by one. My mapping operation - in fact - must enrich the input items by executing a database query, and rather than executing one query for each input item I would prefer to accumulate inputs and execute a batch query.

My solution so far has been to implement a custom BatchMapP processor which accumulates items and periodically (either when a batch size is reached or when a given grouping predicate says so), performs a map function on the group of items, and emits the transformed items using a traverser.

From the API perspective, my processor is created in this way:

public static <I, O> SupplierEx<Processor> batchMapP(int batchSize, FunctionEx<List<I>, List<O>> mapFunction);
public static <I, O> SupplierEx<Processor> batchMapP(BiPredicateEx<List<I>, I> groupPredicate, FunctionEx<List<I>, List<O>> mapFunction)

Isn't there a built-in processor that already does that for me?

Oliv
  • 10,221
  • 3
  • 55
  • 76
Mirko Luchi
  • 321
  • 3
  • 4

1 Answers1

0

Unfortunately in Jet 4.1 there's no such a processor in the public core API. We have AsyncTransformUsingServiceBatchedP, which is close to your requirement. It is batched, but it is also asynchronous. But it's easy to turn it to synchronous if you return a completed future:

ServiceFactory<?, Connection> sf = ServiceFactories.nonSharedService(pCtx -> createConnection())
        .withDestroyServiceFn(Connection::close)
        .toNonCooperative(); // needed because we'll block

Vertex map = dag.newVertex("map", 
        AsyncTransformUsingServiceBatchedP.supplier(sf, 1, 1024,
                (conn, inputBatch) -> {
                    /* execute the query for a batch of statements*/
                    List<Object> results = executeQueryForBatch(conn, inputBatch);
                    return CompletableFuture.completedFuture(traverseIterable(results));
                }));

The processor uses automatic batching: it creates the batches from items that are readily available, without waiting. This gives you minimal latency when there are few items, but doesn't limit throughput when there are many items. But the batch size can go down to as low as single item if there are few items, which creates extra load for the remote system due to the number of calls.

If you want your BatchMapP processor to use the automatic batching, override the method process(int ordinal, Inbox inbox) instead of the tryProcess(int ordinal, Object item) method.

Oliv
  • 10,221
  • 3
  • 55
  • 76
  • Feel free to create an issue to add a method for this processor to the public core API. – Oliv May 12 '20 at 20:27