Questions tagged [spark-streaming]

Spark Streaming is an extension of the core Apache Spark API that enables high-throughput, fault-tolerant stream processing of live data streams. From the version 1.3.0, it supports exactly-once processing semantics, even in face of failures.

5565 questions
2
votes
1 answer

Trying to understand how Spark Streaming works?

This might be a stupid question, but I can't seem to find any doc clarifying this in pure English (ok, exaggerated), and after reading the official doc and some blogs, I'm still confused about how driver and executors work. Here is my current…
avocado
  • 2,615
  • 3
  • 24
  • 43
2
votes
1 answer

H2O Spark streaming 2.1 distribution

I have been intermittently getting distribution error when running a sample IRIS model in sparkling water. Sparkling water: 2.1 Spark streaming kafka - 0.10.0.0 Running locally using spark submit - Only master DistributedException from xxx:54321,…
Lalit Agarwal
  • 2,354
  • 1
  • 14
  • 18
2
votes
1 answer

Stateful streaming Spark processing

I'm learning Spark and trying to build a simple streaming service. For e.g. I have a Kafka queue and a Spark job like words count. That example is using a stateless mode. I'd like to accumulate words counts so if test has been sent a few times in…
kikulikov
  • 2,512
  • 4
  • 29
  • 45
2
votes
1 answer

How to calculate z-score on dataframe API in ApacheSpark stucured streaming?

I'm currently struggling with the following: z-score is defined as: z = (x-u)/sd (where x is the individual value, u the mean of the window and sd the standard deviation of the window) I can calculate u and sd on the window but don't know how to…
Romeo Kienzler
  • 3,373
  • 3
  • 36
  • 58
2
votes
2 answers

How to spark-submit a Spark Streaming application with spark-streaming-kafka-0-8 dependency?

I am tring to run the spark streaming example : Directkafkawordcount.scala To create jar I am using "build.sbt" with plugin: name := "Kafka Direct" version := "1.0" scalaVersion := "2.11.6" libraryDependencies ++= Seq ("org.apache.spark" %…
Angshusuri
  • 57
  • 3
  • 10
2
votes
2 answers

How can I write results of JavaPairDStream into output kafka topic on Spark Streaming?

I'm looking for a way to write a Dstream in an output kafka topic, only when the micro-batch RDDs spit out something. I'm using Spark Streaming and spark-streaming-kafka connector in Java8 (both latest versions) I cannot figure out. Thanks for the…
Aniello Guarino
  • 197
  • 2
  • 10
2
votes
1 answer

Pyspark dataframe split json column values into top-level multiple columns

I have a json column which can contain any no of key:value pairs. I want to create new top level columns for these key:value pairs. For Eg if I have this data A B "{\"C\":\"c\" , \"D\":\"d\"...}" b This…
gashu
  • 863
  • 2
  • 10
  • 21
2
votes
1 answer

Spark 2.1 Structured Streaming - Using Kakfa as source with Python (pyspark)

With Apache Spark version 2.1, I would like to use Kafka (0.10.0.2.5) as source for Structured Streaming with pyspark: kafka_app.py: from pyspark.sql import…
JS G.
  • 158
  • 1
  • 9
2
votes
1 answer

Spark Streaming Dynamic Allocation ExecutorAllocationManager

We have a spark 2.1 streaming application with a mapWithState, enabling spark.streaming.dynamicAllocation.enabled=true. The pipeline is as follows: var rdd_out = ssc.textFileStream() .map(convertToEvent(_)) .combineByKey(...., new…
Joe Bledo
  • 21
  • 2
2
votes
1 answer

How to get group by Alias column in Dataframe SELECT list

I am doing SUM on multiple column, those columns want to include in the SELECT list. Below are my work: val df=df0 .join(df1, df1("Col1")<=>df0("Col1")) .filter((df1("Colum")==="00") …
sks
  • 169
  • 4
  • 15
2
votes
2 answers

Spark Streaming - Count distinct element in state

I am having a dstream with a key-value pair of VideoID-UserID, what is a good practice of count a distinct UserID group by VideoID? // VideoID,UserID foo,1 foo,2 bar,1 bar,2 foo,1 bar,2 As above, I want to get VideoID-CountUserID by removing…
2
votes
1 answer

Spark Java: How to move data from HTTP source to Couchbase sink?

I've a .gz file available on a Web server that I want to consume in a streaming manner and insert the data into Couchbase. The .gz file has only one file in it, which in turn contains one JSON object per line. Since Spark doesn't have a HTTP…
Abhijit Sarkar
  • 21,927
  • 20
  • 110
  • 219
2
votes
3 answers

Spark + Kafka streaming NoClassDefFoundError kafka/serializer/StringDecoder

I'm trying to send message from my kafka producer and stream it in spark streaming. But I'm getting the following error when I run my application on spark submit. Error Exception in thread "main" java.lang.NoClassDefFoundError:…
Gaurav Ram
  • 1,085
  • 3
  • 16
  • 32
2
votes
1 answer

Exception in thread "main" java.lang.NoClassDefFoundError: org/spark_project/guava/cache/CacheLoader

When i am trying to execute my kafka spark project. I am getting below error: Exception in thread "main" java.lang.NoClassDefFoundError: org/spark_project/guava/cache/CacheLoader at…
2
votes
1 answer

Application hangs when I do join for PipelinedRDD and RDD from DStream

I use spark 1.6.0 with Spark Streaming and have one problem with wide operations. Code example: There is RDD called "a" which has type: class 'pyspark.rdd.PipelinedRDD'. "a" was received as: # Load a text file and convert each line to a Row. …