Questions tagged [spark-streaming]

Spark Streaming is an extension of the core Apache Spark API that enables high-throughput, fault-tolerant stream processing of live data streams. From the version 1.3.0, it supports exactly-once processing semantics, even in face of failures.

5565 questions
2
votes
1 answer

How to load JSON(path saved in csv) with Spark?

I am new to Spark. I can load the .json file in Spark. What if there are thousands of .json files in a folder. picture of .json files in the folder And I have a csv file, which classifies the .json files with labels.picture of csv file What should I…
Fengyu
  • 35
  • 2
  • 6
2
votes
1 answer

How to figure out if DStream is empty

I have 2 inputs, where first input is stream (say input1) and the second one is batch (say input2). I want to figure out if the keys in first input matches single row or more than one row in the second input. The further transformations/logic…
Dazzler
  • 807
  • 9
  • 11
2
votes
1 answer

How to map key/value partitions in paralell in Spark Streaming

I have a Spark Streaming program running in local mode in which I receive JSON messages from a TCP socket connection, several per batch interval. Each of these messages has an ID, which I use to create a key/value JavaPairDStream, such that in each…
manuel mourato
  • 801
  • 1
  • 12
  • 36
2
votes
2 answers

Multiple consumers exactly-once processing with Apache Spark Streaming

I am looking to process elements on a queue (Kafka or Amazon Kinesis) and to have multiple operations to be performed on each element, for example: Write that to HDFS cluster Invoke a rest API Trigger a notification on slack. On each of these…
Edmondo
  • 19,559
  • 13
  • 62
  • 115
2
votes
1 answer

Execute multiple actions parallel/async in Spark Streaming

is there a way to execute multiple actions async/parallel in spark streaming? Here is my code: positions.foreachRDD(rdd -> { JavaRDD pbv = rdd.map(p -> A.create(p)); javaFunctions(pbv).writerBuilder("poc",…
mananana
  • 393
  • 3
  • 15
2
votes
0 answers

spark streaming application - deployment best practices

I am using spark-submit cluster mode deployment for my application to run it in production. But this places a requirement of having the jars in the same path in all the nodes and also the config file which is passed as argument in the same path. I…
Knight71
  • 2,927
  • 5
  • 37
  • 63
2
votes
2 answers

How to add jar using HiveContext in the spark job

I am trying to add JSONSerDe jar file to in order to access the json data load the JSON data to hive table from the spark job. My code is shown below: SparkConf sparkConf = new SparkConf().setAppName("KafkaStreamToHbase"); JavaSparkContext…
Bhaskar
  • 271
  • 7
  • 20
2
votes
1 answer

Dstream Runtime Creation/Destruction

Can Dstream with new names be created and older dstream be destroyed on runtime? //Read the Dstream inputDstream = ssc.textFileStream("./myPath/") Example: I am reading a file called cvd_filter.txt in which every single line contains a string…
vkb
  • 458
  • 1
  • 7
  • 18
2
votes
0 answers

Could not read until the end sequence number of the range

I have a spark streaming application which reads data from Kinesis, processes it and sends the result into Elasticsearch. It was working fine But it suddenly started throwing the following error while reading data from…
SMN
  • 46
  • 5
2
votes
0 answers

Spark Streaming variable in UpdateStateByKey not changing value after restarting application from checkpoint

I'm currently working in Python building a moderately complex application that relies on stateful data from multiple sources. With Pyspark I've run into an issue where a global variable used within an updateStateByKey function isn't being assigned…
JoeP
  • 21
  • 3
2
votes
1 answer

Spark Streaming -> DStream.checkpoint versus SparkStreaming.checkpoint

I have Spark 1.4 Streaming application, which reads data from Kafka, uses statefull transformation, and has batch interval of 15 seconds. In order to use statefull transformations, as well as recover from driver failures, I need to set checkpointing…
Srdjan Nikitovic
  • 853
  • 2
  • 9
  • 19
2
votes
1 answer

Caching DStream in Spark Streaming

I have a Spark streaming process which reads data from kafka, into a DStream. In my pipeline I do two times (one after another): DStream.foreachRDD( transformations on RDD and inserting into destination). (each time I do different processing and…
2
votes
0 answers

Spark akka throws a java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc

I have built spark using scala 2.11. I ran the following steps : ./dev/change-scala-version.sh 2.11 mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package After building spark successfully, I tried to intialize spark via akka model . So,…
Raveesh Sharma
  • 1,486
  • 5
  • 21
  • 38
2
votes
0 answers

Spark streaming reliable receiver and BlockGenerator?

As I understand, when implementing a reliable receiver for spark streaming, the block generation needs to be taken care of in the custom receiver. Is this as easy as collecting some events to some kind of queue and then storing the iterator? Or…
Sunny
  • 605
  • 10
  • 35
2
votes
0 answers

Spark Streaming Task Distribution

I have one spark streaming program that uses updateStateByKey. When I run it on a cluster with 3 machines, all updateStateByKey tasks (these are heavy tasks) run on one machine. This results in a scheduling delay on inputs, while other machines have…
Majid Hajibaba
  • 3,105
  • 6
  • 23
  • 55
1 2 3
99
100