Hazelcast Jet internal optimization

Question

I examine the Hazelcast Jet for my project needs, but I found the documentation really vague regarding the following topics:

1) When I perform a data join on two list streams...like for example:

BatchStage<Trade> trades = p.drawFrom(list("trades"));
BatchStage<Entry<Integer, Broker>> brokers =    
p.drawFrom(list("brokers"));
BatchStage<Tuple2<Trade, Broker>> joined = trades.hashJoin(brokers,
    joinMapEntries(Trade::brokerId),
    Tuple2::tuple2);
joined.drainTo(Sinks.logger());

then can I somehow tell Jet what join underneath will actually occur? Either map-side join or reduce side join...? I mean just imagine the "brokers" set to be small and the trades set to be really huge. Optimal technique to perform the join of these two sets is map-side join aka broadcast join....What data will be transferred over network when Jet will do the join? Are there any size based optimization?

2) I was testing the following scenario:

easy pipeline:

private Pipeline createPipeLine() {
    Pipeline p = Pipeline.create();
    BatchStage stage = p.drawFrom(Sources.<Date>list("master"));
    stage.drainTo(Sinks.logger());
    return p;
}

list("master") is being constantly filled by another node in the cluster. Now when I submit this pipeline to cluster, only subset of the list("master") is drained to logger. Can I somehow set the Jet job to be constantly draining the list("master") to standard output?

Thanks in advance

Did you look at the [reference manual?](https://docs.hazelcast.org/docs/jet/0.7/manual/#hash-join) — Can Gencer, Nov 05 '18 at 08:04

score 2 · Answer 1 · edited Nov 05 '18 at 08:41

2

From Javadoc of HashJoin:

Implementationally, the hash-join transform is optimized for throughput so that each computing member has a local copy of all the enriching data, stored in hashtables (hence the name). The enriching streams are consumed in full before ingesting any data from the primary stream.

For your example, all the items from broker list will be consumed first from all members then trades list will be consumed.

IList is a batch source, you need a streaming source to continuously consume the items. You can use IQueue as a source, here is an easy way to create a source for a queue:

StreamSource<Trade> queueSource = SourceBuilder.<IQueue<Trade>>stream("queueStream", 
        c -> c.jetInstance().getHazelcastInstance().getQueue("trades"))
    .<Trade>fillBufferFn((queue, buf) -> buf.add(queue.poll()))
    .build();

edited Nov 05 '18 at 08:41

Marko Topolnik

195,646
29
319
436

answered Nov 05 '18 at 06:48

ali

876
4
9

2. Regarding IList, I missed that it's batch source only, I apologize. – Tomas Kloucek Nov 05 '18 at 11:43
I was successfull with IMap configuring it as StreamSource then, works as expected. 1. "each computing member has a local copy of all the enriching data" This is very usefull information in terms of howto design the Pipeline job according to the size of processed data. Thank you. – Tomas Kloucek Nov 05 '18 at 11:53
1

[This blog post](https://blog.hazelcast.com/ways-to-enrich-stream-with-jet/) might also give you some insight on how to work with joins/enrichment. – Can Gencer Nov 05 '18 at 15:07

Hazelcast Jet internal optimization

1 Answers1