Spark Streaming is an extension of the core Apache Spark API that enables high-throughput, fault-tolerant stream processing of live data streams. From the version 1.3.0, it supports exactly-once processing semantics, even in face of failures.
Questions tagged [spark-streaming]
5565 questions
2
votes
0 answers
Spark Streaming with large messages java.lang.OutOfMemoryError: Java heap space
I am using Spark Streaming 1.6.1 with Kafka0.9.0.1 (createStreams API) HDP 2.4.2, My use case sends large messages to Kafka Topics ranges from 5MB to 30 MB in such cases Spark Streaming fails to complete its job and crashes with below exception.I am…

nilesh1212
- 1,561
- 2
- 26
- 60
2
votes
2 answers
DStream checkpointing has been enabled but the DStreams with their functions are not serializable
I want to send DStream to Kafka , but it doesn't still work.
searchWordCountsDStream.foreachRDD(rdd =>
rdd.foreachPartition(
partitionOfRecords =>
{
val props = new HashMap[String, Object]()
…

Kof
- 65
- 2
- 5
2
votes
2 answers
Zeppelin Twitter Streaming Example Not Working
I am trying to run Twitter Streaming Example in Zeppelin. After I searched around, I added "org.apache.bahir:spark-streaming-twitter_2.11:2.0.0" into Spark Interpreter. So I can make the first part work, as in:
Apache Zeppelin 0.6.1: Run Spark 2.0…

user1828513
- 367
- 2
- 7
- 16
2
votes
1 answer
RDD toDF() : Erroneous Behavior
I built a SparkStreaming App that fetches content from A Kafka Queue and intends to put the data into a MySQL table after some pre-processing and structuring.
I call the 'foreachRDD' method on the SparkStreamingContext. The issue that I'm facing is…

arshellium
- 215
- 1
- 6
- 17
2
votes
0 answers
Share variable across spark stream
How can I share variables across spark streams in pyspark.
I'm trying to share a datafame that holds various values for a combination of features like example platform etc.
The program works once when the global variable is first initialized. It…

dvshekar
- 93
- 11
2
votes
1 answer
Flow time stamp through streaming functions
How/is it possible to generate a random number or obtain system time for each time a batch is run with Spark Streaming?
I have two functions which process a batch of messages:
1 - First processes the Key, creates a file (csv) and writes headers
2 -…

Ken Alton
- 686
- 1
- 9
- 21
2
votes
1 answer
Spark UI's kill is not killing Driver
I am trying to kill my spark-kafka streaming job from Spark UI. It is able to kill the application but the driver is still running.
Can anyone help me with this. I am good with my other streaming jobs. only one of the streaming jobs is giving this…

AKC
- 953
- 4
- 17
- 46
2
votes
1 answer
How to join two (or more) streams (JavaDStream) in apache spark
We have a spark streaming application that consumes Gnip compliance stream.
In the old version of the API, the compliance stream was provided by one end point but now it is provided by 8 different endpoints.
We could run the same spark application…

Fanooos
- 2,718
- 5
- 31
- 55
2
votes
1 answer
Spark Memory/worker issues & what is the correct spark configuration?
I have total of 6 nodes in my spark cluster. 5 nodes had each 4 core and 32GB ram, and one of the nodes(node 4) had 8 cores and 32GB ram.
So i have total of 6 nodes - 28 cores, 192GB RAM.( i want to use half of the memory, but all cores)
Planning to…

AKC
- 953
- 4
- 17
- 46
2
votes
1 answer
Reading files dynamically from HDFS from within spark transformation functions
How can a file from HDFS be read in a spark function not using sparkContext within the function.
Example:
val filedata_rdd = rdd.map { x => ReadFromHDFS(x.getFilePath) }
Question is how ReadFromHDFS can be implemented?Usually to read from HDFS we…

darkknight444
- 546
- 8
- 21
2
votes
0 answers
Reading from kafka .8.1 and writing to kafka .9.0
I have a requirement where i should read messages from kafka v.8.1 (in cluster A) and write to kafka v.9.0 (in cluster B).
I am using spark streaming to read from kafka A and push messages into kafka B using spark native kafka classes.
It is giving…

jintocvg
- 158
- 3
- 14
2
votes
0 answers
Use spark-streaming as a scheduler
I have a Spark job that reads from an Oracle table into a dataframe. The way it seems the jdbc.read method works is to pull an entire table in at once, so I constructed a spark-submit job to work in batch. Whenever I have data I need manipulated I…

tadamhicks
- 905
- 1
- 14
- 34
2
votes
3 answers
storing a Dataframe to a hive partition table in spark
I'm trying to store a stream of data comming in from a kafka topic into a hive partition table. I was able to convert the dstream to a dataframe and created a hive context. My code looks like this
val hiveContext = new…

Riyan Mohammed
- 247
- 2
- 6
- 20
2
votes
1 answer
Spark Error: invalid log directory /app/spark/spark-1.6.1-bin-hadoop2.6/work/app-20161018015113-0000/3/
My spark application is failing with the above error.
Actually my spark program is writing the logs to that directory. Both stderr and stdout are being written to all the workers.
My program use to worik fine earlier. But yesterday i changed the…

AKC
- 953
- 4
- 17
- 46
2
votes
1 answer
why spark can not recovery from checkpoint by using getOrCreate
Following offical doc, I'm trying to revovery StreamingContext:
def get_or_create_ssc():
cfg = SparkConf().setAppName('MyApp').setMaster('local[10]')
sc = SparkContext(conf=cfg)
ssc = StreamingContext(sparkContext=sc, batchDuration=2)
…

Zhang Tong
- 4,569
- 3
- 19
- 38