Hazlecast Jet Cluster. Work load not distributed

Question

I have one huge csv file. I have a Jet cluster with 3 nodes. When the job is submitted only one node processes the entire file. What I want is the each part of work can be distributed. Meaning, how can I optimally use the resources in each of the nodes to get the work done faster.

    Pipeline p = Pipeline.create();

    BatchSource<List<String>> source = Sources.filesBuilder("files")
            .glob("*.csv")
            .build(path -> Files.lines(path).skip(1).map(line -> split(line)));

    p.readFrom(source)
            .map(function1)
            .map(function2)
            .writeTo(Sinks.filesBuilder("out").build());
    instance.newJob(p).join();

Mike Yawn · Answer 1 · 2020-07-30T20:15:52.133

1

In Jet 4.2 the rebalance() operator was introduced, I think this will do exactly what you need. By default a non-partitioned data source gets processed on a single node, but adding a rebalance() will distribute the work.

See https://jet-start.sh/docs/api/more-transforms#rebalance

The rebalance() would go between the readFrom(source) and map(function1)

edited Jul 30 '20 at 20:15

answered Jul 30 '20 at 16:11

Mike Yawn

796
3
8

Hazlecast Jet Cluster. Work load not distributed

1 Answers1