hazelcast-jet deployment and data ingestion

Question

I have a distributed system running on AWS EC2 instances. My cluster has around 2000 nodes. I want to introduce a stream processing model which can process metadata being periodically published by each node (cpu usage, memory usage, IO and etc..). My system only cares about the latest data. It is also OK with missing a couple of data points when the processing model is down. Thus, I picked hazelcast-jet which is an in-memory processing model with great performance. Here I have a couple of questions regarding the model:

What is the best way to deploy hazelcast-jet to multiple ec2 instances?
How to ingest data from thousands of sources? The sources push data instead of being pulled.
How to config client so that it knows where to submit the tasks?

It would be super useful if there is a comprehensive example where I can learn from.

score 2 · Accepted Answer · edited Sep 26 '19 at 09:03

What is the best way to deploy hazelcast-jet to multiple ec2 instances?

Download and unzip the Hazelcast Jet distribution on each machine:

$ wget https://download.hazelcast.com/jet/hazelcast-jet-3.1.zip
$ unzip hazelcast-jet-3.1.zip
$ cd hazelcast-jet-3.1

Go to the lib directory of the unzipped distribution and download the hazelcast-aws module:

$ cd lib
$ wget https://repo1.maven.org/maven2/com/hazelcast/hazelcast-aws/2.4/hazelcast-aws-2.4.jar

Edit bin/common.sh to add the module to the classpath. Towards the end of the file is a line
```
CLASSPATH="$JET_HOME/lib/hazelcast-jet-3.1.jar:$CLASSPATH"
```
You can duplicate this line and replace -jet-3.1 with -aws-2.4.
Edit config/hazelcast.xml to enable the AWS cluster discovery. The details are here. In this step you'll have to deal with IAM roles, EC2 security groups, regions, etc. There's also a best practices guide for AWS deployment.
Start the cluster with jet-start.sh.

How to config client so that it knows where to submit the tasks?

A straightforward approach is to specify the public IPs of the machines where Jet is running, for example:

ClientConfig clientConfig = new ClientConfig();
clientConfig.getGroupConfig().setName("jet");
clientConfig.addAddress("54.224.63.209", "34.239.139.244");

However, depending on your AWS setup, these may not be stable, so you can configure to discover them as well. This is explained here.

How to ingest data from thousands of sources? The sources push data instead of being pulled.

I think your best option for this is to put the data into a Hazelcast Map, and use a mapJournal source to get the update events from it.

I am struggling on having my client connect to the cluster. My client is outside of aws. I tried the straightforward approach above but the connection always timeout (added 5701 to the inbound rule of security group for all destinations). — Z.SP, Sep 26 '19 at 08:24
I also tried the auto discover approach. I followed the link you post but got following error 'Caused by: com.hazelcast.config.properties.ValidationException: There is no discovery strategy factory to create `DiscoveryStrategyConfig{properties={secret-key=*****, security-group-name=****, access-key=****, hz-port=5701, region=us-west-2}, className='com.hazelcast.aws.AwsDiscoveryStrategy', discoveryStrategyFactory=null}' Is it a typo in a strategy classname? Perhaps you forgot to include implementation on a classpath?` — Z.SP, Sep 26 '19 at 08:25
You get this exception when the `hazelcast-aws` jar is not on your classpath. It has to be there for the client as well, if AWS discovery is to work. — Marko Topolnik, Sep 26 '19 at 08:59
Adding `hazelcast-aws` solved the class not found exception. However, I still cannot connect to the cluster even if I updated my security group inbound rule to allow all traffics from everywhere. I will create a separate question with more detailed info on the connection part. Accept this answer given it helps me successfully setup the cluster. — Z.SP, Sep 26 '19 at 20:36

hazelcast-jet deployment and data ingestion

1 Answers1

Linked