0

Am using now kafka in Python. Was wondering if Spark Kafka is needed or can we use just use kafka through pyKafka.

My concern was Spark creates overhead (pyspark) in the process, and if we don't use any spark functions, just Kafka streaming is required.

What are the inconvenients of using Pyspark and kafka spark ?

Ruslan Ostafiichuk
  • 4,422
  • 6
  • 30
  • 35
tensor
  • 3,088
  • 8
  • 37
  • 71
  • 1
    What is "Spark Kafka" or "Kakfa Spark"? Note: *Kafka Streams* is a Java library. Kafka Python libraries don't have all the features of Spark Streaming or Kafka Streams – OneCricketeer Mar 06 '18 at 03:58
  • for example, which part is not available ? – tensor Mar 06 '18 at 06:29
  • 1
    @Tensor: _What_ is it you're wanting to do? What are your requirements? From this, it's much easier to explain and position different technologies and their pros/cons for your requirements. Spark Streaming, Kafka Streams, and KSQL could all be options here - depending on what you want to do. – Robin Moffatt Mar 06 '18 at 11:25
  • Python is just a regular consumer/producer. Spark has dataframes and stuff and a bunch of libraries to integrate with external systems. Not that Python doesn't either, but you cannot scale a pure Python application as easily as Spark – OneCricketeer Mar 06 '18 at 14:35
  • pyspark has serialization overhead.... – tensor Mar 06 '18 at 15:19

1 Answers1

0

It totally depends on the use case at hand, as all mentioned in the comments, however I passed with the same situation a couple of months ago, I will try to transfer my knowledge and how I decided to move to kafka-streams instead of spark-streaming.

In my use case, we only used spark to do a realtime streaming from kafka, and don't do any sort of map-reduce, windowing, filtering, aggregation.

Given the above case, I did the comparison based on 3 dimentions:

  1. Technicality
  2. DevOps
  3. Cost

Below image show the table of comparison I did to convince my team to migrate to use kafka-streams and suppress spark, Cost is not added in the image as it totally depends on your cluster size (HeadNode-WorkerNodes).

V.I. NOTE: Again, this is based on your case, I just tried to give you a pointer how to do the comparison, but spark itself has lots of benefits, which is irrelevant to describe it in this question.

enter image description here

Karim Tawfik
  • 1,286
  • 1
  • 11
  • 21