Distributed issues scaling TCP servers on EC2 instances

Question

Using EC2 instances (along with Amazon Auto Scaling and Elastic Load Balancing) I have several instances of a TCP server running in Amazon Web Services. Each EC2 instance has access to a centralized database (running on Amazon RDS). To make this backend scalable, new EC2 instances (of the TCP server) are scaled up and down depending on demand.

The servers have been made using the Python Twisted framework. The system powers a custom instant messaging service, with multiple group chats that users can join.

When a user starts using the service they establish a TCP socket with one of the TCP servers. Each server stores in memory the currently connected users (i.e. the open TCP sockets) and which ‘group chat’ each user is currently ‘in’ (and thus subscribed to). All chat data created is stored in the database.

The Problem

When UserA posts a message in GroupChatZ, all users ‘in’ GroupChatZ should receive the message. This is simple if there is only 1 TCP server: it would search its memory for all users ‘in’ that 'group chat' and send them the new message. However as there is more than one server, when a new message is created it is necessary for that server to pass the message on to all the other servers (i.e. EC2 instances).

What is the most efficient solution to this problem? Perhaps using AWS components.

One solution I can think of is for each server to store its IP address in the database when it first starts up, and get the IP address of all other connected servers and set up a TCP connection with them. When each new message is received, the server handling it could send it to all other servers it is connected to.

However TCP connections are not 100% reliable and this solution adds complexity.

I suspect there is actually a good way to use some Amazon Web Services component to implement a simple subscriber-publisher type mechanism (think Observer design pattern). I.e. where one server adds something all other servers get the message from it in real-time.

Quick update, I've just started mucking about with Amazon SNS and creating a topic each Twisted server will subscribe to. So far the results look promising. — Jon Cox, Mar 27 '12 at 01:27

score 0 · Accepted Answer · answered Mar 25 '12 at 17:56

Not only are TCP connections not 100% reliable, EC2 instances are not either. They can disappear at any time (and believe me, they sometimes do). The internal IP address of an instance is also subject to change (e.g. if it reboots). If you use an Elastic IP address, connections from outside the AWS data center (e.g. the chat clients) will have a stable (set of) IP's to connect to. However, using Elastic IP's to communicate between servers is relatively slow as the connection is routed outside of the AWS firewall and then back in (last time I checked anyhow).

Here are a few strategies to consider:

Use a larger EC2 instance that can handle all of your connections, with a hot standby if your availability requirements so dictate. You may find it's less expensive to scale up than to scale out if it can greatly simplify your Engineering efforts, if the upper bound on traffic is known (e.g. if this is an enterprise chat application rather than an internet-facing one).
If you decide you still want to scale out, consider a transactional, distributed cache such as EH Cache to store the chat data. That class of problems has already been solved. You will spend a LOT of Engineering time handling all of the cases that one of the established distributed caches already handle.

There's no redundancy with a single EC2 instance. How would you make config changes or deployments keeping high availability? The cache idea is a good one though. — Diego, Mar 25 '12 at 18:00
@Diego: If he needs high availability, he needs a hot standby that mirrors the current state of the primary server (option #1). If 5-10 minutes of downtime can be tolerated in the event of a server outage, it's adequate to have an EBS-based AMI that knows how to self-initialize chat state from the DB upon firing up. It would take over the Elastic IP of the original server (which can be scripted at startup time). — Eric J., Mar 25 '12 at 18:14

score 0 · Answer 2 · answered Mar 25 '12 at 17:59

I think Amazon SQS (simple queue system) can help. You create a message queue for each server. When a message is received, the server places the message in every server's queue. Server's poll the queue for new messages frequently. If a server grabs a message directed to a user not connected to it, the message is ignored, otherwise delivered.

Distributed issues scaling TCP servers on EC2 instances

2 Answers2