Questions tagged [spark-streaming-kafka]

Spark Streaming integration for Kafka. Direct Stream approach provides simple parallelism, 1:1 correspondence between Kafka partitions and Spark partitions, and access to offsets and metadata.

250 questions
1
vote
0 answers

sparkstreaming kafka cunsumer auto close,

I don't want to use one consumer for all topics, I want to use this method to improve consumption efficiency val kafkaParams = Map( ConsumerConfig.GROUP_ID_CONFIG -> group, ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> brokers, …
1
vote
0 answers

kafka stream with AWS Glue - authorization issue

I want to stream a kafka topic from Glue job but I got the following error StreamingQueryException: Not authorized to access topics: [topic_name] This my current script # Script generated for node Kafka Stream dataframe_KafkaStream_node1 =…
Smaillns
  • 2,540
  • 1
  • 28
  • 40
1
vote
0 answers

pyspark ml training ALS: No ratings available from MapPartitionsRDD

I'm trying to train ALS with data in each batch from kafka using spark streaming and facing with below error. I think it's because the rating column is negative or something invalid like wrong data type, so I filtered and changed to double it but…
1
vote
1 answer

Spark speculative tasks and its performance overhead

I am currently exploring spark's speculative tasks option. Below are my configuration which I am planning to use. I am reading the data from kafka and using repartition() I am creating around 200+ tasks in my streaming code. …
1
vote
1 answer

kafkaUtils.createDirectStream gives an error

I changed a line from createStream to createDirectStream since the new library does not support createStream I have checked it from here https://codewithgowtham.blogspot.com/2022/02/spark-streaming-kafka-cassandra-end-to.html scala> val lines =…
1
vote
0 answers

Spark task failure with ClassCastException [C cannot be cast to [J, at org.apache.spark.unsafe.memory.HeapMemoryAllocator.allocate

We have a java spark streaming application which does scd2 operation on deltalake. We were using spark 3.0.0 and delta lake 0.7.0 after upgrading to Spark 3.2.0 and delta 1.1.0, we can see the following exception (under load of 100K events) Caused…
1
vote
1 answer

Apache Spark with kafka stream - Missing Kafka

I have trying to setup the Apache Spark with kafka and wrote simple program in local and its failing and not able figure out from debug. build.gradle.kts implementation ("org.jetbrains.kotlin:kotlin-stdlib:1.4.0") implementation…
1
vote
0 answers

Write Spark Stream to Phoenix table

I am trying to figure out how to write a Spark Stream to a Phoenix table in the least convoluted way. So far I have only found this solution: kafka-to-phoenix, which requires some deep ad-hoc engineering (to my noob eyes). I can tailor the linked…
1
vote
1 answer

foreach() method with Spark Streaming errors

I'm trying to write data pulled from a Kafka to a Bigquery table every 120 seconds. I would like to do some additional operations which by documentation should be possible inside the .foreach() or foreachBatch() method. As a test i wanted to print a…
1
vote
1 answer

Spark and Kafka: how to increase parallelism for producer sending large batch of records improving network usage?

I am diving to understand how can I send(produce) a large batch of records to a Kafka Topic from Spark. From the docs I can see that there is an attempt to use the same producer across tasks in the same workers. When sending a lot of records at…
1
vote
1 answer

Can I use Airflow to start/stop spark streaming job

I have two type of job: Spark Batch jobs and and Spark streaming jobs. I would like to schedule and manage them both with airflow. Airflow is using for job has stop. But I want to use it for my streaming job. Can anyone give me some idea or other…
1
vote
1 answer

java.io.IOException: Failed to write statements to batch_layer.test. The latest exception was Key may not be empty

I am trying to count the number of words in the text and save result to the Cassandra database. Producer reads the data from the file and sends it to kafka. Consumer uses spark streaming to read and process the date,and then sends the result of the…
1
vote
0 answers

Spark streaming Dynamic Schema Evolution from Kafka Eventhub on Microbatch

We are streaming data from the Kafka Eventhub. The records may have a nested structure. The schema is inferred dynamically from the data and the Delta table is formed with the schema of the first incoming batch of data. Note: The data read from…
1
vote
1 answer

Spark Streaming: Read from HBase by received stream keys?

What is the best way to compare received data in Spark Streaming to existing data in HBase? We receive data from kafka as DStream, and before writing it down to HBase we must scan HBase for data based on received keys from kafka, do some calculation…
1
vote
1 answer

How spark calculates the window start time with given window interval?

Consider I have a input df with a timestamp field column and when setting window duration (with no sliding interval) as : 10 minutes with input of time(2019-02-28 22:33:02) window formed is as (2019-02-28 22:30:02) to (2019-02-28 22:40:02) 8…