Questions tagged [spark-streaming]

Spark Streaming is an extension of the core Apache Spark API that enables high-throughput, fault-tolerant stream processing of live data streams. From the version 1.3.0, it supports exactly-once processing semantics, even in face of failures.

5565 questions
2
votes
1 answer

MapReduce: How to pass HashMap to mappers

I'm designing the new generation of an analysis system which needs to process many events from many sensors in near-real time. And to do that I want to use one of the Big Data Analytics platforms such as Hadoop, Spark Streaming or Flink. In order to…
Gal Dreiman
  • 3,969
  • 2
  • 21
  • 40
2
votes
0 answers

how to close the SQLContext programmatically?

We have a Spark Streaming job, inside DStream foreachRDD method, i am creating a SQLContext, the reason that i am creating SQLContext inside foreachRDD method instead of outside is, when i enable check-pointing it says SQLContext is not…
Shankar
  • 8,529
  • 26
  • 90
  • 159
2
votes
1 answer

Gradual Increase in old generation heap memory

I am facing a very strange issue in spark streaming. I am using spark 2.0.2, number of nodes 3, number of executors 3 {1 receiver and 2 processor}, memory per executor 2 GB, cores per executor 1. The batch interval is 10 sec. My batch size is…
deenbandhu
  • 599
  • 5
  • 18
2
votes
2 answers

Spark Streaming - TIMESTAMP field based processing

I'm pretty new to spark streaming and I need some basic clarification that I couldn't fully understand reading the documentation. The use case is that I have a set of files containing dumping EVENTS, and each events has already inside a field…
Sokrates
  • 93
  • 1
  • 11
2
votes
1 answer

Spark Standalone: TransportRequestHandler: Error while invoking RpcHandler - when starting workers on different machine/VMs

I am totally new at this, so please pardon for obvious mistakes if any. Exact errors: At Slave: INFO TransportClientFactory: Successfully created connection to /10.2.10.128:7077 after 69 ms (0 ms spent in bootstraps) WARN Worker: Failed to connect…
2
votes
3 answers

SaveToCassandra , Is there any ordering in which the rows are written

This is the content of my RDD which I am saving to Cassandra table. But looks like the 2nd row is written first and then the first row overwrites it. So I end up with bad output. (494bce4f393b474980290b8d1b6ebef9, 2017-02-01, PT0H9M30S,…
shylas
  • 99
  • 4
  • 13
2
votes
1 answer

Spark foreachpartition connection improvements

I have written a spark job which does below operations Reads data from HDFS text files. Do a distinct() call to filter duplicates. Do a mapToPair phase and generate pairRDD Do a reducebykey call do the aggregation logic for grouped…
Sam
  • 1,333
  • 5
  • 23
  • 36
2
votes
2 answers

How to print PythonTransformedDStream

I'm trying to run word count example integrating AWS Kinesis stream and Apache Spark. Random lines are put in Kinesis at regular intervals. lines = KinesisUtils.createStream(...) When I submit my application, lines.pprint() I don't see any values…
ArunDhaJ
  • 621
  • 6
  • 18
2
votes
1 answer

spark-redshift - Error save using Spark 2.1.0

I'm using the spark-redshift to load a Kafka stream getting data events from a MySQL binlog. When I try to save the RDD into Redshift a exception is throwed: command> ./bin/spark-submit --packages…
Carleto
  • 951
  • 1
  • 9
  • 17
2
votes
1 answer

Issue on Spark Streaming data put data into HBase

I am a beginner in this field, so I can not get a sense of it... HBase ver: 0.98.24-hadoop2 Spark ver: 2.1.0 The following code try to put receiving data from Spark Streming-Kafka producer into HBase. Kafka input data format is like this :…
Chris Joo
  • 577
  • 10
  • 24
2
votes
0 answers

How to unit-test Spark Streaming code in Java

JavaStreamingContext.queueStream() Javadoc states: Changes to the queue after the stream is created will not be recognized Therefore, using a queue for testing window based scenarios in Java is not an option, as opposed to Scala, because elements…
Daniel Nitzan
  • 1,582
  • 3
  • 19
  • 36
2
votes
2 answers

Can Spark streaming and Spark applications be run within the same YARN cluster?

Hello people and happy new year ;) ! I am bulding a lambda architecture with Apache Spark, HDFS and Elastichsearch. In the following picture, here what I am trying to do: So far, I have written the source code in java for my spark streaming and…
Yassir S
  • 1,032
  • 3
  • 21
  • 44
2
votes
1 answer

Invoking a utility(external) inside Spark streaming job

I have a streaming job consuming from Kafka (using createDstream). its stream of "id" [id1,id2,id3 ..] I have an utility or an api which accepts an Array of id's and does some external call and receives back some info say "t" for each id…
2
votes
2 answers

Scala/Spark: Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging

I am pretty new to scala and spark. Trying to fix my set-up of spark/scala development. I am confused by the versions and missing jars. I searched on stackoverflow, but still stuck in this issue. Maybe something missing or mis-configured. Running…
BAE
  • 8,550
  • 22
  • 88
  • 171
2
votes
1 answer

Why do Spark DataFrames not change their schema and what to do about it?

I'm using Spark 2.1's Structured Streaming to read from a Kafka topic whose contents are binary avro-encoded. Thus, after setting up the DataFrame: val messages = spark .readStream .format("kafka") .options(kafkaConf) .option("subscribe",…