Questions tagged [spark-streaming]

Spark Streaming is an extension of the core Apache Spark API that enables high-throughput, fault-tolerant stream processing of live data streams. From the version 1.3.0, it supports exactly-once processing semantics, even in face of failures.

5565 questions
2
votes
1 answer

Applying DataFrame operations to a Single row in mapWithState

I'm on spark 2.1.0 with Scala 2.11. I have a requirement to store state in Map[String, Any] format for every key. The right candidate to solve my problem appears to be mapWithState() which is defined in PairDStreamFunctions. The DStream on which I…
Kryptic Coder
  • 612
  • 2
  • 8
  • 20
2
votes
1 answer

Receiver less approach for spark-steaming with kinesis

For Spark streaming with kafka we have Directstream which is receiver less approach and maps the kafka partitions to spark RDD partitions. Currently we have a application in which we use Kafka Direct approach and maintain our on offsets in RDBMS, Do…
2
votes
0 answers

Spark Job Processing Time increases to 4s without explanation

We are running a 1 namenode and 3 datanode cluster on top of Azure. On top of this I am running my spark job on Yarn-Cluster mode. Also, We are using HDP 2.5 which have spark 1.6.2 integrated into its setup. Now I have this very weird issue where…
Biplob Biswas
  • 1,761
  • 19
  • 33
2
votes
0 answers

How to perform multi threading or parallel processing in spark implemented in scala

Hi am having a spark streaming program which is reading the events from eventhub and pushing it topics. for processing each batch it is taking almost 10 times the batch time. when am trying to implement multithreading am not able to see much…
2
votes
2 answers

How to write compressed data to Kafka in Spark Streaming?

Is it possible to write gzip compressed data to Kafka from Spark streaming? Are there any examples/samples that shows how to write and read compressed data from Kafka in Spark streaming job?
vijay
  • 1,203
  • 1
  • 13
  • 25
2
votes
1 answer

PySpark - top-n words from multiple files files

I have a python dictionary: diction = {'1.csv': 'this is is a test test test ', '2.txt': 'that that was a test test test'} I have created an RDD like this: docNameToText = sc.parallelize(diction) I need to calculate find the top-2 strings…
stfd1123581321
  • 163
  • 1
  • 2
  • 6
2
votes
1 answer

find count of messages foreachRDD in JavaDStream

Hi i am trying to integrate Kafka with Spark streaming. I want to find count of messages foreachRDD in JavaDStream. Please find the below code and give me some suggestions. public class App { @SuppressWarnings("serial") public static void…
Jagadeesh
  • 421
  • 5
  • 16
2
votes
1 answer

Spark Streaming - obtain batch-level performance stats

I'm setting up an Apache Spark cluster to perform realtime streaming computations and would like to monitor the performance of the deployment by tracking various metrics like sizes of batches, batch processing times, etc. My Spark Streaming program…
jithinpt
  • 1,204
  • 2
  • 16
  • 33
2
votes
1 answer

Spark RDD vs DataSet performance

I am new to Spark. I am experimenting to use Spark 2.1 version for CEP purpose. To detect missing event in last 2 minutes. I am converting received input to input events of JavaDSStream and then performing reducebykeyandWindow on inputEvents and…
Abirami
  • 87
  • 7
2
votes
0 answers

Kafka Spark streaming HBase insert issues

I'm using Kafka to send a file with 3 columns using Spark streaming 1.3 to insert into HBase. This is how my HBase looks like : ROW COLUMN+CELL zone:bizert column=travail:call, timestamp=1491836364921,…
Zied Hermi
  • 229
  • 1
  • 2
  • 11
2
votes
3 answers

In spark streaming job, how to collect error messages from executors to drivers and log these at the end of each streaming batch?

I want to log all the error messages in the driver machine. How to do this efficiently.
vindev
  • 2,240
  • 2
  • 13
  • 20
2
votes
1 answer

kafka or redis for realtime BI

I am working on a project for real time business intelligence and i am using the elastic stack spark streaming and kafka ? but I am wondering if I may use redis instead of kafka because it appears that redis in an in memory beast that can forward…
2
votes
0 answers

Spark streaming kafka offset acknowledgement - are gaps possible?

Lets say I have window2 -> window1 (window1 goes before window2). Lets say offsets are: (start2, end2) and (start1, end1) correspondingly. Since each window processing might take different time, window2 might finish processing before window1. Then:…
2
votes
0 answers

Spark Streaming - restarting from Checkpoint

We are building a fault tolerant system that can read form Kafka and write HBase & HDFS. The batch runs every 5 seconds. Here's the scenario we were hoping to setup: Start new spark streaming process with checkpointing enabled, read from kafka,…
Shay
  • 505
  • 1
  • 3
  • 19
2
votes
1 answer

Apache Spark Object not Serializable Exception for json parser

I am reading the data[json as String] from kafka queue and tring to parse json as String into case class using liftweb json api. here is the code Snippet val sparkStreamingContext = new StreamingContext(sparkConf, Seconds(5)) val kafkaParam:…
Akash Sethi
  • 2,284
  • 1
  • 20
  • 40