1

I setup a Hazelcast-jet cluster on aws ec2 following instructions here. I made use of the hazelcast-aws model so that nodes can automatically discover each other. The cluster is up and running:

[2019-09-26 22:26:26.288] [INFO   ] com.hazelcast.config.AbstractConfigLocator - Using configuration file at /home/ec2-user/hazelcast-jet-3.1/config/hazelcast.xml
[2019-09-26 22:26:26.416] [INFO   ] com.hazelcast.instance.AddressPicker - [LOCAL] [jet] [3.1] Interfaces is enabled, trying to pick one address matching to one of: [172.31.*.*]
[2019-09-26 22:26:26.416] [INFO   ] com.hazelcast.instance.AddressPicker - [LOCAL] [jet] [3.1] Prefer IPv4 stack is true, prefer IPv6 addresses is false
[2019-09-26 22:26:26.425] [INFO   ] com.hazelcast.instance.AddressPicker - [LOCAL] [jet] [3.1] Picked [172.31.33.212]:5701, using socket ServerSocket[addr=/0:0:0:0:0:0:0:0,localport=5701], bind any local is true
[2019-09-26 22:26:26.460] [INFO   ] com.hazelcast.system - [172.31.33.212]:5701 [jet] [3.1] Hazelcast Jet 3.1 (20190624 - 000ced7) starting at [172.31.33.212]:5701

It also successfully found its peer:

[2019-09-26 22:26:26.664] [INFO   ] com.hazelcast.spi.impl.operationservice.impl.BackpressureRegulator - [172.31.33.212]:5701 [jet] [3.1] Backpressure is disabled
[2019-09-26 22:26:27.103] [INFO   ] com.hazelcast.instance.Node - [172.31.33.212]:5701 [jet] [3.1] Activating Discovery SPI Joiner
[2019-09-26 22:26:27.297] [INFO   ] com.hazelcast.jet.impl.metrics.JetMetricsService - [172.31.33.212]:5701 [jet] [3.1] Configuring metrics collection, collection interval=5 seconds, retention=5 seconds, publishers=[Management Center Publisher, JMX Publisher]
[2019-09-26 22:26:27.343] [INFO   ] com.hazelcast.jet.impl.JetService - [172.31.33.212]:5701 [jet] [3.1] Setting number of cooperative threads and default parallelism to 36
[2019-09-26 22:26:27.345] [INFO   ] com.hazelcast.spi.impl.operationexecutor.impl.OperationExecutorImpl - [172.31.33.212]:5701 [jet] [3.1] Starting 36 partition threads and 19 generic threads (1 dedicated for priority tasks)
[2019-09-26 22:26:27.354] [INFO   ] com.hazelcast.internal.diagnostics.Diagnostics - [172.31.33.212]:5701 [jet] [3.1] Diagnostics disabled. To enable add -Dhazelcast.diagnostics.enabled=true to the JVM arguments.
[2019-09-26 22:26:27.364] [INFO   ] com.hazelcast.core.LifecycleService - [172.31.33.212]:5701 [jet] [3.1] [172.31.33.212]:5701 is STARTING
[2019-09-26 22:26:27.772] [INFO   ] com.hazelcast.nio.tcp.TcpIpConnector - [172.31.33.212]:5701 [jet] [3.1] Connecting to /172.31.47.40:5701, timeout: 10000, bind-any: true
[2019-09-26 22:26:27.782] [INFO   ] com.hazelcast.nio.tcp.TcpIpConnection - [172.31.33.212]:5701 [jet] [3.1] Initialized new cluster connection between /172.31.33.212:47065 and /172.31.47.40:5701
[2019-09-26 22:26:33.786] [INFO   ] com.hazelcast.internal.cluster.ClusterService - [172.31.33.212]:5701 [jet] [3.1]

Members {size:2, ver:6} [
        Member [172.31.47.40]:5701 - 3ba123c0-e98b-47dc-9bf5-34944d2c53a2
        Member [172.31.33.212]:5701 - 0127e9a7-80b1-4c5d-a122-2da5aa7fa042 this
]

Everything looks good except for my client (not on aws) not being able to connect to the cluster. All I am doing is running the word counting example. The only difference is that, instead of having both client and server run in the same JVM, I want to submit the task to the cluster I setup. I replaced the JetInstance jet = Jet.newJetInstance(); with (following instructions):

        ClientConfig clientConfig = new ClientConfig();

        ClientNetworkConfig networkConfig = clientConfig.getNetworkConfig();
        clientConfig.getGroupConfig().setName("jet");
        networkConfig.getAwsConfig().setEnabled(true)
                .setProperty("access-key", "abc")
                .setProperty("secret-key", "cde")
                .setProperty("region", "us-west-2")
                .setProperty("security-group-name", "eee")
                .setProperty("hz-port", "5701")
                .setProperty("use-public-ip", "true");

        JetInstance jet = Jet.newJetClient(clientConfig);

I can tell the client is looking for the right endpoints:

INFO: hz.client_0 [jet] [3.0] [3.12] Trying to connect to cluster with name: jet
Sep 26, 2019 3:40:55 PM com.hazelcast.client.connection.nio.ClusterConnectorService
INFO: hz.client_0 [jet] [3.0] [3.12] Trying to connect to [172.31.47.40]:5701 as owner member
Sep 26, 2019 3:41:00 PM com.hazelcast.client.connection.nio.ClusterConnectorService
WARNING: hz.client_0 [jet] [3.0] [3.12] Exception during initial connection to [172.31.47.40]:5701: com.hazelcast.core.HazelcastException: java.net.SocketTimeoutException
Sep 26, 2019 3:41:00 PM com.hazelcast.client.connection.nio.ClusterConnectorService
INFO: hz.client_0 [jet] [3.0] [3.12] Trying to connect to [172.31.33.212]:5701 as owner member
Sep 26, 2019 3:41:05 PM com.hazelcast.client.connection.nio.ClusterConnectorService
WARNING: hz.client_0 [jet] [3.0] [3.12] Exception during initial connection to [172.31.33.212]:5701: com.hazelcast.core.HazelcastException: java.net.SocketTimeoutException

I already added 5701 to the inbound rule of the security group using by the two ec2 instances. To debug, I ran a couple networking commands to see if port 5701 is open:

[ec2-user@ip-172-31-33-212 ~]$ sudo lsof -i -P -n | grep LISTEN
rpcbind   5428      rpc    8u  IPv4  50298      0t0  TCP *:111 (LISTEN)
rpcbind   5428      rpc   11u  IPv6  50301      0t0  TCP *:111 (LISTEN)
master    5897     root   13u  IPv4  40255      0t0  TCP 127.0.0.1:25 (LISTEN)
sshd      6115     root    3u  IPv4  41329      0t0  TCP *:22 (LISTEN)
sshd      6115     root    4u  IPv6  41331      0t0  TCP *:22 (LISTEN)
java     43020 ec2-user   10u  IPv6 118393      0t0  TCP *:5701 (LISTEN)
[ec2-user@ip-172-31-33-212 ~]$ sudo lsof -i:5701
COMMAND   PID     USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
java    43020 ec2-user   10u  IPv6 118393      0t0  TCP *:5701 (LISTEN)
java    43020 ec2-user   45u  IPv6 152973      0t0  TCP ip-172-31-33-212.us-west-2.compute.internal:52599->ip-172-31-47-40.us-west-2.compute.internal:5701 (ESTABLISHED)

My knowledge on networking is limited. I cannot figure out what the issue is. One thing I noticed is that the port is opened for ipv6 while the client tried to connect to the private ipv4 address.

Z.SP
  • 319
  • 1
  • 2
  • 12
  • I repeated your setup and got in, it does look like a networking issue at the AWS level. Can you try using `nc 172.31.47.40 5701`? If the connection is successful, you don't get your prompt back, but after you type in some text and press enter, then you get the prompt back. This means your local machine successfully exchanged data with a Hazelcast Jet instance. – Marko Topolnik Sep 27 '19 at 11:28
  • In my security group's Inbound Rules i have this: `Custom TCP Rule | TCP | 5701 | 0.0.0.0/0` and another similar line but for IPv6 (`::/0`). – Marko Topolnik Sep 27 '19 at 11:36

1 Answers1

1

Marko was right (look at comments of the question). This looks like some AWS network constrains. I setup netcat server with port 5701 on one of my ec2 box. I was not able to connect to the port from my laptop using nc but able to connect to it from another ec2 in the same VPC. I then did the same experiment with port 80. I can connect to the port from both my laptop and ec2 instances from the same VPC. Looks like something only allows instances outside of AWS to connect to a couple of well-known ports of ec2 instances.

Anyways, I unblocked myself by running the hazelcast server on port 80. This is not ideal but much more convenient for me to try out some hazelcast-jet features from my IDE comparing to deploy testing code to ec2.

Z.SP
  • 319
  • 1
  • 2
  • 12