Data streaming API- High availability

Question

In my architecture on AWS, I have a service running on an EC2 instance which calls Twitter streaming API for data ingestion i.e. ingestion of real-time tweets. I call this service TwitterClient.

Twitter API uses a kindof long polling over HTTP protocol to deliver streaming data. The documentation says- a single connection is opened between your app (in my case, TwitterClient) and the API, with new tweets being sent through that connection.

TwitterClient then passes the real-time tweets to the backend (using Kinesis Data streams) for processing.

The problem I am facing is- running multiple EC2 instances in parallel will result in duplicate tweets being ingested and each tweet will be processed several times. However, only one instance of EC2 becomes a single point of failure.

I cannot afford downtime as I can't miss a single tweet.

What should I do to ensure high availability?

Edit: Added a brief description of how Twitter API delivers streaming data

score 1 · Answer 1 · answered Jan 08 '21 at 15:39

The simplest way to implement this is to run multiple EC2 instances in parallel, in different regions. You can certainly get more complex, and use heartbeats between the instances, but this is probably over-engineering.

multiple EC2 instances in parallel will result in duplicate tweets being ingested and each tweet will be processed several times

Tweets have a unique message ID that can be used to deduplicate.

I can't miss a single tweet

This is your real problem. Twitter limits you to a certain number of requests per 15 minute period. Assuming that you have reasonable filter rules (ie, you don't try to read the entire tweetstream, or even the tweetstream for a broad topic), then this should be sufficient to capture all tweets.

However, it may not be sufficient if you're running multiple instances. You could try using two API keys (assuming that Twitter allows that) or adjust your polling frequency to something that allows multiple instances to run concurrently.

Beware, however: as far as I know there are no guarantees. If you need guaranteed access to every relevant tweet, you would need to talk to Twitter (and be prepared to pay them for the privilege).

Yes, missing-tweets-during-downtime is the bigger problem here. Using two APIs with different, yet relevant, filters should do the trick. Let me try it. — theReluctantProgrammer, Jan 09 '21 at 15:32

score 0 · Answer 2 · answered Jan 08 '21 at 13:40

0

You can setup to run 2 EC2 behind a Load Balancer, keeping only one EC2 instance active at a time and other as passive (or backup). 2nd will be active when 1st is down.

answered Jan 08 '21 at 13:40

xs2tarunkukreja

446
3
8

Twitter uses long polling over HTTP to deliver tweets in real time. I don't think ELB can support long polling. I have edited my post- mentioned about long polling. – theReluctantProgrammer Jan 08 '21 at 14:24
Solution will depends upon your goals like what are you polling. It is possible that you can use concept something similar to data sharding. One EC2 instance polling data for one portion and another EC2 instance poll for another portion. Suppose, you are polling data for set of users, then one instance polling data for X user set and another polling for pending users. – xs2tarunkukreja Jan 09 '21 at 04:17
If the approach doesn't work then better to have 3 EC2 instance. 2 instances will have your service deployed (one as active and another is passive). 3rd instance of EC2 have a service that monitor the health of 1st EC2 instance, if 1st is down or unhealthy then you can make 2nd EC2 instance active. Actually I suggested to use LB for 3rd EC2 jobs. – xs2tarunkukreja Jan 09 '21 at 04:20
These solutions are interesting. However, I think, they may not resolve the missing-tweets-during-downtime issue. – theReluctantProgrammer Jan 09 '21 at 09:47
Solution :- 3rd EC2 keep checking the health of 1st EC2 after every 1 minute or less than that. Passive EC2 (2nd) keep polling the tweets but not delivering. It keep data for 1 minute (based on health check).. Also, Active instance keep track of data what is delivered on some DB (Size of DB is very less as it keep tract for last 5 minutes only.). Whenever active become unhealthy, Passive become active... it check for last 1 minutes tweet that are not delivered and delivered those from its storage. – xs2tarunkukreja Jan 09 '21 at 10:03

Data streaming API- High availability

2 Answers2