Spark Streaming is an extension of the core Apache Spark API that enables high-throughput, fault-tolerant stream processing of live data streams. From the version 1.3.0, it supports exactly-once processing semantics, even in face of failures.
Questions tagged [spark-streaming]
5565 questions
2
votes
1 answer
Applying DataFrame operations to a Single row in mapWithState
I'm on spark 2.1.0 with Scala 2.11. I have a requirement to store state in Map[String, Any] format for every key. The right candidate to solve my problem appears to be mapWithState() which is defined in PairDStreamFunctions. The DStream on which I…

Kryptic Coder
- 612
- 2
- 8
- 20
2
votes
1 answer
Receiver less approach for spark-steaming with kinesis
For Spark streaming with kafka we have Directstream which is receiver less approach and maps the kafka partitions to spark RDD partitions. Currently we have a application in which we use Kafka Direct approach and maintain our on offsets in RDBMS,
Do…

kalyan chakravarthy
- 643
- 10
- 29
2
votes
0 answers
Spark Job Processing Time increases to 4s without explanation
We are running a 1 namenode and 3 datanode cluster on top of Azure. On top of this I am running my spark job on Yarn-Cluster mode.
Also, We are using HDP 2.5 which have spark 1.6.2 integrated into its setup. Now I have this very weird issue where…

Biplob Biswas
- 1,761
- 19
- 33
2
votes
0 answers
How to perform multi threading or parallel processing in spark implemented in scala
Hi am having a spark streaming program which is reading the events from eventhub and pushing it topics. for processing each batch it is taking almost 10 times the batch time.
when am trying to implement multithreading am not able to see much…

ankush reddy
- 481
- 1
- 5
- 28
2
votes
2 answers
How to write compressed data to Kafka in Spark Streaming?
Is it possible to write gzip compressed data to Kafka from Spark streaming? Are there any examples/samples that shows how to write and read compressed data from Kafka in Spark streaming job?

vijay
- 1,203
- 1
- 13
- 25
2
votes
1 answer
PySpark - top-n words from multiple files files
I have a python dictionary:
diction = {'1.csv': 'this is is a test test test ', '2.txt': 'that that was a test test test'}
I have created an RDD like this:
docNameToText = sc.parallelize(diction)
I need to calculate find the top-2 strings…

stfd1123581321
- 163
- 1
- 2
- 6
2
votes
1 answer
find count of messages foreachRDD in JavaDStream
Hi i am trying to integrate Kafka with Spark streaming.
I want to find count of messages foreachRDD in JavaDStream.
Please find the below code and give me some suggestions.
public class App {
@SuppressWarnings("serial")
public static void…

Jagadeesh
- 421
- 5
- 16
2
votes
1 answer
Spark Streaming - obtain batch-level performance stats
I'm setting up an Apache Spark cluster to perform realtime streaming computations and would like to monitor the performance of the deployment by tracking various metrics like sizes of batches, batch processing times, etc. My Spark Streaming program…

jithinpt
- 1,204
- 2
- 16
- 33
2
votes
1 answer
Spark RDD vs DataSet performance
I am new to Spark. I am experimenting to use Spark 2.1 version for CEP purpose.
To detect missing event in last 2 minutes. I am converting received input to input events of JavaDSStream and then performing reducebykeyandWindow on inputEvents and…

Abirami
- 87
- 7
2
votes
0 answers
Kafka Spark streaming HBase insert issues
I'm using Kafka to send a file with 3 columns using Spark streaming 1.3 to insert into HBase.
This is how my HBase looks like :
ROW COLUMN+CELL
zone:bizert column=travail:call, timestamp=1491836364921,…

Zied Hermi
- 229
- 1
- 2
- 11
2
votes
3 answers
In spark streaming job, how to collect error messages from executors to drivers and log these at the end of each streaming batch?
I want to log all the error messages in the driver machine. How to do this efficiently.

vindev
- 2,240
- 2
- 13
- 20
2
votes
1 answer
kafka or redis for realtime BI
I am working on a project for real time business intelligence and i am using the elastic stack spark streaming and kafka ? but I am wondering if I may use redis instead of kafka because it appears that redis in an in memory beast that can forward…

Drissi Yazami
- 37
- 10
2
votes
0 answers
Spark streaming kafka offset acknowledgement - are gaps possible?
Lets say I have window2 -> window1 (window1 goes before window2).
Lets say offsets are: (start2, end2) and (start1, end1) correspondingly.
Since each window processing might take different time, window2 might finish processing before window1. Then:…

Konstantin Kulagin
- 684
- 6
- 16
2
votes
0 answers
Spark Streaming - restarting from Checkpoint
We are building a fault tolerant system that can read form Kafka and write HBase & HDFS. The batch runs every 5 seconds. Here's the scenario we were hoping to setup:
Start new spark streaming process with checkpointing enabled, read from kafka,…

Shay
- 505
- 1
- 3
- 19
2
votes
1 answer
Apache Spark Object not Serializable Exception for json parser
I am reading the data[json as String] from kafka queue and tring to parse json as String into case class using liftweb json api.
here is the code Snippet
val sparkStreamingContext = new StreamingContext(sparkConf, Seconds(5))
val kafkaParam:…

Akash Sethi
- 2,284
- 1
- 20
- 40