0

I am new to Hazelcast Jet so was wondering if I am doing something wrong.

I am testing tis locally. I start up 2 instances of Hazelcast Jet locally:

Jet.newJetInstance();

This is just by running separate class that has public static void main twice.

Then I submit the job starting a new instance of Jet that has all the logic. I am printing the number of records processed. I only see this being printed in one node rather than equally spaced out as it is suppose to run on all the nodes. Am I doing something wrong or am I missing any setting.

Here is the code for my Streaming process

    package com.geek.hazelcast;

import com.google.common.collect.Lists;
import com.hazelcast.core.IMap;
import com.hazelcast.jet.Jet;
import com.hazelcast.jet.JetInstance;
import com.hazelcast.jet.Job;
import com.hazelcast.jet.Traverser;
import com.hazelcast.jet.Traversers;
import com.hazelcast.jet.aggregate.AggregateOperations;
import com.hazelcast.jet.core.AbstractProcessor;
import com.hazelcast.jet.core.ProcessorMetaSupplier;
import com.hazelcast.jet.core.ProcessorSupplier;
import com.hazelcast.jet.function.DistributedFunctions;
import com.hazelcast.jet.pipeline.Pipeline;
import com.hazelcast.jet.pipeline.Sinks;
import com.hazelcast.jet.pipeline.Sources;
import com.hazelcast.jet.pipeline.WindowDefinition;
import org.apache.commons.lang.RandomStringUtils;

import java.util.ArrayList;
import java.util.List;
import java.util.Random;

public class JetDemo {

    public static void main( String[] args ) throws Exception {


        JetInstance jet = Jet.newJetClient();

        IMap<String, Long> counts = jet.getMap("testmap");

        Pipeline p = Pipeline.create();

        p.drawFrom(Sources.streamFromProcessor("test", TestProcessor.streamWords()))
                .addTimestamps()
                .window(WindowDefinition.sliding(60_000, 30_000))
                .groupingKey(DistributedFunctions.wholeItem())
                .aggregate(AggregateOperations.counting())
                .drainTo(Sinks.map("testmap"));



        try {
            //JobConfig jcg = new JobConfig();
            //jcg.setProcessingGuarantee()
            Job job = jet.newJob(p);
            job.join();
            counts.entrySet()
                    .stream().forEach(e -> System.out.println(e.getKey() + " " + e.getValue()));
            System.out.println(counts);
        } finally {
            jet.getCluster().shutdown();
        }

    }

    public static class TestProcessor extends AbstractProcessor {

        int total = 100000;
        int processed = 0;
        private Traverser<String> traverser;
        List<String> randomWords;

        public TestProcessor() {
            randomWords = Lists.newArrayListWithExpectedSize(20);
            for(int i = 0; i < 200; i++) {
                randomWords.add(RandomStringUtils.randomAlphabetic(10));
            }
        }

        @Override
        public boolean complete() {

            System.out.println("processed " + processed);

          if(processed == total) {
              return true;
          }

           if(traverser == null) {
              traverser = getWords();
               processed = processed + 1000;
           }

            if(emitFromTraverser(traverser)) {
              traverser = null;
            }


            return false;
        }

        @Override
        public boolean isCooperative() {
            return true;
        }

        private Traverser<String> getWords() {
            Random r = new Random();
            int Low = 0;
            int High = 200;
            List<String> list = new ArrayList<>();
            for(int i = 0; i < 1000; i++) {
                int index = r.nextInt(High-Low) + Low;
                list.add(randomWords.get(index));
            }


            return Traversers.traverseIterable(list);
        }

        public static ProcessorMetaSupplier streamWords() {
            return ProcessorMetaSupplier.forceTotalParallelismOne(ProcessorSupplier.of(() -> new TestProcessor()));
        }
    }


}

Thanks

sachin jain
  • 224
  • 1
  • 4
  • 16
  • You should submit the job using a _client_ (`Jet.newJetClient()`), not another Jet cluster member instance. Then the job will run on the already formed cluster. – Marko Topolnik Jul 28 '18 at 07:06
  • I tried it did not work. I don't think it is that coz I tried the same thing with this example: https://github.com/hazelcast/hazelcast-jet-code-samples/blob/0.6-maintenance/batch/wordcount/src/main/java/WordCount.java and it works there. The only difference is that WordCount is batch and my example is Stream. This is how I am creating StreamSource: `ProcessorMetaSupplier.forceTotalParallelismOne(ProcessorSupplier.of(() -> new TestProcessor()))` May be it has got to do with the way I a creating StreamSource – sachin jain Jul 28 '18 at 12:47
  • So basically I took the word problem on github and did this `p.drawFrom(Sources.map(BOOK_LINES)) .flatMap(e -> traverseArray(delimiter.split(e.getValue().toLowerCase()))) .filter(word -> !word.isEmpty()) .map(word -> { System.out.println("word is " + word); return word; }) etc... .drainTo(Sinks.map(COUNTS));` return p; and did the same thing and I could see my print statement on different nodes – sachin jain Jul 28 '18 at 12:53
  • @MarkoTopolnik I have copy pasted my code let me know if it helps – sachin jain Jul 28 '18 at 14:36
  • You have a source with total parallelism of one, and you print only from that processor. Naturally, there's only one instance of the processor in the entire cluster. Also note that you `join()` a job with an infinite stream as the source; that call will block forever. – Marko Topolnik Jul 28 '18 at 16:00
  • Hi Marko, Thanks for you comments. I was following this example and was trying to understand how this works. Is there any example or link you can point me to where I can use the same processor instance but distribute the work. https://github.com/hazelcast/hazelcast-jet/blob/master/hazelcast-jet-core/src/main/java/com/hazelcast/jet/impl/connector/ReadIListP.java – sachin jain Jul 28 '18 at 23:28
  • Oh i think I know what going on. Processors always run on the same node and if we want to run computation on multiple nodes then we need to run things the are executed in the pipeline on multiple nodes – sachin jain Jul 28 '18 at 23:45
  • No, actually the DAG edge going out of the source vertex should be distributed so it sends the data to all the other members. I think our pipeline planner doesn't take this into account properly, we'll look into it. – Marko Topolnik Jul 29 '18 at 07:38

0 Answers0