In my architecture on AWS, I have a service running on an EC2 instance which calls Twitter streaming API for data ingestion i.e. ingestion of real-time tweets. I call this service TwitterClient.
Twitter API uses a kindof long polling over HTTP protocol to deliver streaming data. The documentation says- a single connection is opened between your app (in my case, TwitterClient) and the API, with new tweets being sent through that connection.
TwitterClient then passes the real-time tweets to the backend (using Kinesis Data streams) for processing.
The problem I am facing is- running multiple EC2 instances in parallel will result in duplicate tweets being ingested and each tweet will be processed several times. However, only one instance of EC2 becomes a single point of failure.
I cannot afford downtime as I can't miss a single tweet.
What should I do to ensure high availability?
Edit: Added a brief description of how Twitter API delivers streaming data