Best way to store stream of small binary files (BGP updates)

Question

This question may look like this. I am trying to gather ideas on how to implement a BGP pipeline.

I am receiving 100-1000 messages (BGP updates) per second, a few kilobytes per update, over Kafka.

I need to archive them in a binary format with some metadata for fast lookup: I am building periodically a "state" of the BGP table which will merge all the updates received over a certain time. Thus the need of a database.

What I was doing until now: group them in "5 minute" files (messages end-to-end) as it is common thing for BGP collection tools and add the link in a database. I realize some disadvantages: complicated (having to group by key, manage Kafka offset commit), no fine selection where to start/end.

What I am thinking: using a database (Clickhouse/Google BigTable/Amazon Redshift) and insert every single entry with the metadata and a link to the unique update stored on S3/Google Cloud storage/local file.

I am worried of the download performances (most likely over HTTP) since compiling all the updates into a state may take a few thousands of those messages. Do you have experience of batch downloading this? I do not think storing the updates directly in the database would be optimal too.

Any opinion, ideas, suggestions? Thank you

score 0 · Answer 1 · answered Aug 02 '18 at 21:10

What I was doing until now: group them in "5 minute" files (messages end-to-end) as it is common thing for BGP collection tools and add the link in a database. I realize some disadvantages: complicated (having to group by key, manage Kafka offset commit), no fine selection where to start/end.

Why don’t you try Kafka-streams which gives you windowing feature and then just group by key and dump into database? With Kafka-streams you won’t have to worry about group by key and many other issues you mentioned.

If Kafka-streams is not an option for you then just store the message with update one at a time in database and the dB reader can just group by time window and key.

score 0 · Accepted Answer · answered Aug 03 '18 at 16:24

Cloud Bigtable is capable of 10,000 requests per second per "node", and costs $0.65 per node per hour. The smallest production cluster is 3 nodes for a total of 30,000 rows per second. Your application calls for a maximum of 1,000 requests per second. While Cloud Bigtable can handle your workload, I would suggest that you consider Firestore.

At a couple of K per message, I would also consider putting the entire value in the database rather than just the metadata for ease of use.

Best way to store stream of small binary files (BGP updates)

2 Answers2