TCP communication on port not responding on different Availability Zone or Subnet

Question

I'm a little curious and confused about this situation. We setup a monitoring instance scraping an exposed endpoint on 2 different instances. Both on same VPC, same security group, same route table and ACL. Also, both instances uses the same AMI. For some reason the tcp communication on port 5001 doesn't work on the machine that's using the subnet 10.0.1.0 , but it works on the using the subnet 10.0.0.0. The instance that's working is also on the same AZ of the monitoring machine (us-east-1a), the one that doesn't work is on (us-east-1b).

After a lot of tcpdump and troubleshooting, cause other ports works like, 80, 443, 4001. I've decided to create an AMI of the instance on 10.0.0.0 and deploy a new machine on the same subnet and AZ. Surprising that worked, now I have 3 machines, 2 of them on the same subnet sending metrics over 5001, and the other one returning timeout.

Is this something related to public IP's? Account limitation?

Thanks, I'm a little worried with this

Edit:

I've done what Tim told in the reply. I've created an AMI of the working instance that's on 10.0.0.0 and deployed it on 10.0.1.0, and it worked. So just to be clear. Both AMI from the 2 machines worked on the subnets. I'll call 10.0.0.0 as subnetA and 10.0.1.0 as subnetB. The AMI from B was deployed on A and it worked. The AMI from A was deployed on B and it worked as well.. I'm a little confused.

BTW: Those machines were created by terraform long time ago, now we are using Pulumi,maybe something happened during the terraform apply and no one saw it.

Odd, and interesting. The AWS setup you described sounds ok. First I would look at the VPC flow logs to see if packets are arriving at the machine if a response is sent - it's fiddly but worthwhile. Next try deploying the AMI of 10.0.0.0 in the 10.0.1.0 subnet, and try deploying an AMI of the 10.0.1.0 machine in the other subnet. Please edit your question to include the results of both tests, then tag me in a comment if you would like me to consider what you found. — Tim, May 26 '22 at 19:19
Hey @Tim , thanks for the help on this. I've made some tests (it's on the Edit), looks like something happened to the machine during the deploy. — forgondolin, May 26 '22 at 20:53
Sounds like a glitch, perhaps a PEBKAC error (Google it if you're not familiar). I'm not sure it's worth the time to solve this problem. — Tim, May 26 '22 at 21:38

TCP communication on port not responding on different Availability Zone or Subnet

0 Answers0