How to design a server monitoring system running on AWS

Question

I am building some form of a monitoring agent application that is running on AWS EC2 machines. I need to be able to send commands to the agent running on a specific EC2 instance and only an agent running on that instance should pick it up and act on it. New EC2 instances can come and go at any point in time. I can use kinesis and push all commands for all instances there and agents can pick up the ones targeted for them. The problem with this is that agents will have to receive a lot of commands that are not for them and filter it out.

I can also use SQS per instance, but then this will require to create/delete SQS every time new instance is being provisioned.

Would like to hear if there are already proven solutions for a similar scenario.

If you don't want to use SSM's run command, then SQS is probably the way to go. I don't see much issue with "create/delete SQS every time new instance is being provisioned". It's something that can be easly obtained with UserData for creation of instance, and termination thourgh CW events. — Marcin, Nov 23 '20 at 00:02
The problem is that at any point when something goes wrong and agent or EC2 instance shuts down the SQS will be hanging there. Technically it is possible to implement a clean up procedure, but I am looking into simpler solution — Tamerlane, Nov 27 '20 at 06:01
@Tamerlane , Any more question? Or if the answer if helpful you can accept it. — J00MZ, Nov 23 '21 at 17:57

Dennis Traub · Answer 1 · 2020-11-18T22:57:36.663

There already is a fully functional feature provided by AWS. I would rather use that one as opposed to reinventing the wheel, as it is a robust, well-integrated, and proven solution that’s being leveraged by thousands of AWS customers to gain operational insights into their instance fleets:

AWS Systems Manager Agent (SSM Agent) is a piece of software that can be installed and configured on an EC2 instance (and it’s pre-installed on many of the default AMIs, including both versions of Amazon Linux, Ubuntu, and various versions of Windows Server). SSM Agent makes it possible to update, manage, and configure these resources. The agent processes requests from the Systems Manager service in the AWS Cloud, and then runs them as specified in the request. SSM Agent then sends status and execution information back to the Systems Manager service by using the Amazon Message Delivery Service.

You can learn more about AWS Systems Manager and the breadth and depth of functionality it provides here.

Given the nature of the agent, business logic and etc ... we have to develop it and SSM Agent won't work. — Tamerlane, Nov 19 '20 at 00:32

score 0 · Answer 2 · answered Nov 18 '20 at 22:48

0

Have you considered using Simple Notifications Service? Each new EC2 instance could subscribe to a topic using e.g. http, and remove previous subscribers.

That way the topic would stay constant regardless of EC2 rotation.

It might be worth noting that SNS supports subscription filters, so it can decide which messages deliver to which endpoint.

answered Nov 18 '20 at 22:48

Tomek Klas

722
5
6

Yes SNS is one of the options I am considering. But removing subscriptions is not an option as I can receive messages to be distributed to different instances at any point. What I can do though is to add/remove SQS subscriptions to an SNS topic per agent. Seems too twisted though as if agent crushes Queue will be hanging there. – Tamerlane Nov 19 '20 at 00:40
hmm.. perhaps you could use Route53 to create unique DNS name assign it to EC2 and subscribe using that address.Then reassigning that domain name at launch to new machine would work. – Tomek Klas Nov 19 '20 at 08:19

score 0 · Answer 3 · answered Nov 26 '20 at 22:44

To my observation, AWS SWF could be the option here. Since Amazon SWF is to coordinate work across distributed application components and it provides SDKs for various platforms. Refer to the official FAQs for more in-depth understanding. https://aws.amazon.com/swf/faqs/

J00MZ · Answer 4 · 2020-12-01T14:33:20.423

Not entirely clear what the volume of the monitoring system messages will be.

But the architecture requirements described sounds to me as follows:

The agents on the EC2 instances are (constantly?) polling some centralized service, which is a poll based architecture
The messages being sent are to a specific predetermined EC2 instance, which is a push based architecture.

To support both options without significant filtering of the messages I suggest you try using an intermediate PubSub system such Kafka, which can be managed on AWS by MSK.

Then to differentiate between the instances, create a Kafka topic named by the EC2 instance ID.

This should give you a unique topic that the instance will easily know to access messages for itself on a topic denoted by it's own instance ID.

You can also send/push Producer messages to a specific EC2 instance by sending messages to the topic in the cluster named by it's EC2 instance ID.

Since there are many EC2 instances coming and going you will end up with many topics. To handle the volume of topics, you can trigger and notify CloudWatch on each EC2 termination event and check CloudWatch to see which EC2 instances were terminated and consequently their topic needs deleting.

Alternatively, you can trigger a Lambda directly on the EC2 termination event event and log it by creating a file denoted by the instance ID to an S3 Bucket, which you can watch using an additional Lambda that will delete old EC2 instance topics from the Kafka cluster when their instance ID's appear there.

How to design a server monitoring system running on AWS

4 Answers4