0

I am using Spark-Kafka Integration for working on my project which is to find top trending hashtags on twitter. For this, i am using Kafka for pushing tweets through tweepy Streaming and on the consumer side i am using Spark Streaming for DStream and RDD transformations...

My question is that whether running the streaming process through Kafka for some time may lead to storage issues as i am running both producer and consumer on my local machine... How long can i safely execute the producer (as i need it to run for sometime to get the right trending counts..) ?

Also will it be better if i run it on cloud platforms such as AWS ?

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
Dharmesh Singh
  • 107
  • 2
  • 11

2 Answers2

1

I agree. storage has been the dilemma when running a streaming server, aws has Amazon MSK which is a managed Kafka streaming server, the strong point about it that you can integrate s3 for backups which has a much lower cost than local storage in addition to durability, also EBS storage can be provisioned on the fly

https://aws.amazon.com/blogs/big-data/best-practices-for-running-apache-kafka-on-aws/

AWS PS
  • 4,420
  • 1
  • 9
  • 22
  • Scalability - what about that? – thebluephantom Jan 05 '20 at 18:25
  • 1
    Running your Kafka deployment on Amazon EC2 provides a high performance, scalable solution for ingesting streaming data. AWS offers many different instance types and storage option combinations for Kafka deployments. – AWS PS Jan 05 '20 at 18:32
  • Indeed, that's what I mean, but you can add that to your answer. E.g. number of KAFKA partitions able to be processed. – thebluephantom Jan 05 '20 at 18:35
  • "S3 for backups" - please explain how that works in context of Kafka / Zookeeper. AFAIK, MSK does not offer such a solution – OneCricketeer Jan 05 '20 at 21:40
  • you need Kafka s3 connector https://www.confluent.io/blog/apache-kafka-to-amazon-s3-exactly-once/ – AWS PS Jan 05 '20 at 21:53
  • That's not intended for backups. There is no source connector to read that data back, and if you did, you lose partition, timestamps, and record key information. Plus Amazon cannot provide S3 Connector without license violations. By the way, that blog was written before MSK existed – OneCricketeer Jan 06 '20 at 00:54
1

It's not clear what time window you're using, or where Kafka is running. Calculating trends over 10 minutes or an hour or so, shouldn't take up much disk at all on the Spark cluster.

Kafka storage will of course need to be large enough for your use case

Tweets are not very large. Filtering out hashtags only makes them smaller.

Note: Spark seems like overkill for this, as you could do the same with Kafka Connect for ingest and ksqlDB for computation

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • For getting the trending topics nearly as same as twitter I need to run the kafka twitter streaming for atleast an hour on my local machine... so could it cause any storage issues... ALso are all the streaming messages getting stored on Kafka Broker or Spark cluster .... Does they remain as long as i keep kafka server running or i have to delete the topic manually ? @cricket_007 – Dharmesh Singh Jan 06 '20 at 07:03
  • 1
    It's not clear how many message per second you're getting. If you compute the average size of 100 messages, then do some math for an hour of messages, and then you get larger than any single hdd you have, then you'll have issues on any system – OneCricketeer Jan 06 '20 at 08:01