Spark Streaming is an extension of the core Apache Spark API that enables high-throughput, fault-tolerant stream processing of live data streams. From the version 1.3.0, it supports exactly-once processing semantics, even in face of failures.
Questions tagged [spark-streaming]
5565 questions
2
votes
1 answer
MapReduce: How to pass HashMap to mappers
I'm designing the new generation of an analysis system which needs to process many events from many sensors in near-real time. And to do that I want to use one of the Big Data Analytics platforms such as Hadoop, Spark Streaming or Flink.
In order to…

Gal Dreiman
- 3,969
- 2
- 21
- 40
2
votes
0 answers
how to close the SQLContext programmatically?
We have a Spark Streaming job, inside DStream foreachRDD method, i am creating a SQLContext, the reason that i am creating SQLContext inside foreachRDD method instead of outside is, when i enable check-pointing it says SQLContext is not…

Shankar
- 8,529
- 26
- 90
- 159
2
votes
1 answer
Gradual Increase in old generation heap memory
I am facing a very strange issue in spark streaming. I am using spark 2.0.2, number of nodes 3, number of executors 3 {1 receiver and 2 processor}, memory per executor 2 GB, cores per executor 1. The batch interval is 10 sec. My batch size is…

deenbandhu
- 599
- 5
- 18
2
votes
2 answers
Spark Streaming - TIMESTAMP field based processing
I'm pretty new to spark streaming and I need some basic clarification that I couldn't fully understand reading the documentation.
The use case is that I have a set of files containing dumping EVENTS, and each events has already inside a field…

Sokrates
- 93
- 1
- 11
2
votes
1 answer
Spark Standalone: TransportRequestHandler: Error while invoking RpcHandler - when starting workers on different machine/VMs
I am totally new at this, so please pardon for obvious mistakes if any.
Exact errors:
At Slave:
INFO TransportClientFactory: Successfully created connection to /10.2.10.128:7077 after 69 ms (0 ms spent in bootstraps)
WARN Worker: Failed to connect…

Piyush Banginwar
- 21
- 1
- 2
2
votes
3 answers
SaveToCassandra , Is there any ordering in which the rows are written
This is the content of my RDD which I am saving to Cassandra table.
But looks like the 2nd row is written first and then the first row overwrites it. So I end up with bad output.
(494bce4f393b474980290b8d1b6ebef9, 2017-02-01, PT0H9M30S,…

shylas
- 99
- 4
- 13
2
votes
1 answer
Spark foreachpartition connection improvements
I have written a spark job which does below operations
Reads data from HDFS text files.
Do a distinct() call to filter duplicates.
Do a mapToPair phase and generate pairRDD
Do a reducebykey call
do the aggregation logic for grouped…

Sam
- 1,333
- 5
- 23
- 36
2
votes
2 answers
How to print PythonTransformedDStream
I'm trying to run word count example integrating AWS Kinesis stream and Apache Spark. Random lines are put in Kinesis at regular intervals.
lines = KinesisUtils.createStream(...)
When I submit my application, lines.pprint() I don't see any values…

ArunDhaJ
- 621
- 6
- 18
2
votes
1 answer
spark-redshift - Error save using Spark 2.1.0
I'm using the spark-redshift to load a Kafka stream getting data events from a MySQL binlog.
When I try to save the RDD into Redshift a exception is throwed:
command> ./bin/spark-submit --packages…

Carleto
- 951
- 1
- 9
- 17
2
votes
1 answer
Issue on Spark Streaming data put data into HBase
I am a beginner in this field, so I can not get a sense of it...
HBase ver: 0.98.24-hadoop2
Spark ver: 2.1.0
The following code try to put receiving data from Spark Streming-Kafka producer into HBase.
Kafka input data format is like this :…

Chris Joo
- 577
- 10
- 24
2
votes
0 answers
How to unit-test Spark Streaming code in Java
JavaStreamingContext.queueStream() Javadoc states:
Changes to the queue after the stream is created will not be recognized
Therefore, using a queue for testing window based scenarios in Java is not an option, as opposed to Scala, because elements…

Daniel Nitzan
- 1,582
- 3
- 19
- 36
2
votes
2 answers
Can Spark streaming and Spark applications be run within the same YARN cluster?
Hello people and happy new year ;) !
I am bulding a lambda architecture with Apache Spark, HDFS and Elastichsearch.
In the following picture, here what I am trying to do:
So far, I have written the source code in java for my spark streaming and…

Yassir S
- 1,032
- 3
- 21
- 44
2
votes
1 answer
Invoking a utility(external) inside Spark streaming job
I have a streaming job consuming from Kafka (using createDstream).
its stream of "id"
[id1,id2,id3 ..]
I have an utility or an api which accepts an Array of id's and does some external call and receives back some info say "t" for each id…

Rushabh Mehta
- 83
- 5
2
votes
2 answers
Scala/Spark: Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
I am pretty new to scala and spark. Trying to fix my set-up of spark/scala development. I am confused by the versions and missing jars. I searched on stackoverflow, but still stuck in this issue. Maybe something missing or mis-configured.
Running…

BAE
- 8,550
- 22
- 88
- 171
2
votes
1 answer
Why do Spark DataFrames not change their schema and what to do about it?
I'm using Spark 2.1's Structured Streaming to read from a Kafka topic whose contents are binary avro-encoded.
Thus, after setting up the DataFrame:
val messages = spark
.readStream
.format("kafka")
.options(kafkaConf)
.option("subscribe",…

ssice
- 3,564
- 1
- 26
- 44