Questions tagged [dstream]

Discretized Streams (D-Stream) is an approach that handles streaming computations as a series of deterministic batch computations on small time intervals.

Discretized Streams (D-Stream) is an approach that handles streaming computations as a series of deterministic batch computations on small time intervals. The input data received during each interval is stored reliably across the cluster to form an input dataset for that interval. Once the time interval completes, this dataset is processed via deterministic parallel operations, such as map, reduce and groupBy, to produce new datasets representing program outputs or intermediate state

109 questions
0
votes
1 answer

pyspark - error writing dstream to elasticsearch

I am having a problem indexing data from spark streaming (pyspark) to elasticserach. the data is of type dstream. Below how it looks (u'01B', 0) (u'1A5', 1) .... Here's the Elastic index I am using: index=clus and type=data GET…
severine
  • 305
  • 1
  • 3
  • 11
0
votes
0 answers

Spark Streaming - How to get results out of foreachRDD function?

I'm trying to read Kafka messages using Spark Streaming, do some computations and send the results to another process. val jsonObject = new JSONObject val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc,…
sen
  • 198
  • 2
  • 9
0
votes
1 answer

SparkStreaming application too slow

While develloping a SparkStreaming application (python), I'm not completely sure if I understand well how it works. I just have to read a json file stream (poping in a directory) and perform a join operation on each json object and a reference, and…
Flibidi
  • 153
  • 2
  • 12
0
votes
1 answer

Pyspark - Transfer control out of Spark Session (sc)

This is a follow up question on Pyspark filter operation on Dstream To keep a count of how many error messages/warning messages has come through for say a day, hour - how does one design the job. What I have tried: from __future__ import…
GreenThumb
  • 483
  • 1
  • 7
  • 25
0
votes
1 answer

Pyspark filter operation on Dstream

I have been trying to extend the network word count to be able to filter lines based on certain keyword I am using spark 1.6.2 from __future__ import print_function import sys from pyspark import SparkContext from pyspark.streaming import…
GreenThumb
  • 483
  • 1
  • 7
  • 25
0
votes
0 answers

spark stream kafka parallel receiver receive data unbalanced

I just want try to receive stream data from kafka in parallel. Here is my code: val myKafkaStream = (1 to numReceivers.toInt).map { i => KafkaUtils.createStream(ssc, zkQuorum, group, topicMap) } I run the code on the yarn , the numReceiver is…
Stone
  • 1
0
votes
1 answer

Spark Streaming fault tolerance on DStream batches

Suppose if a stream is received at time X. Suppose my batch duration is 1 minute. Now my executors are processing the first batch. But this execution takes 3 minutes till X+3. But at X+1 and X+2 we receive other two batches. Does that mean that at…
Vijay Krishna
  • 1,037
  • 13
  • 19
0
votes
1 answer

Calculating derived value in Spark Streaming

I have two Key Value Pairs of the type org.apache.spark.streaming.dstream.DStream[Int]. First Key value pair is (word,frequency). Second key value pair is (Number of rows,Value). I would like to divide frequency by value in for each word. But, I…
neoguy
  • 61
  • 8
0
votes
1 answer

How to perform zipping between two DStreams in Scala?

I have two windowed dstreams that I would like to zip like the normal zipping in RDDs. Note : The main goal is to calculate the mean and stdv of the window dstream in case there is a better way to calculate this.
Ahmed Kamal
  • 1,478
  • 14
  • 27
0
votes
0 answers

Spark Streaming: Average of all the time

I wrote a Spark Streaming application which receives temperature values and calculates the average temperature of all time. For that i used the JavaPairDStream.updateStateByKey transaction to calculate it per device (separated by the Pair's key).…
D. Müller
  • 3,336
  • 4
  • 36
  • 84
0
votes
0 answers

Spark Streaming - Travis CI and GitHub custom receiver - continuous data but empty RDD?

Lately, as a part of a scientific research, I've been developing an application that streams (or at least should) data from Travis CI and GitHub, using their REST API's. The purpose of this is to get insight into the commit-build relationship, in…
dsafa
  • 783
  • 2
  • 8
  • 29
0
votes
1 answer

Spark : get Multiple DStream out of a single DStream

Is is possible to get multiple DStream out of a single DStream in spark. My use case is follows: I am getting Stream of log data from HDFS file. The log line contains an id (id=xyz). I need to process log line differently based on the id. So I…
Alok
  • 1,374
  • 3
  • 18
  • 44
0
votes
2 answers

Apache Spark Scala API: ReduceByKeyAndWindow in Scala

As I'm new to Spark's Scala API I have the following problem: In my java code I did a reduceByKeyAndWindow transformation, but now I saw, that there's only a reduceByWindow (as there's also no PairDStream in Scala). However, I got the first steps in…
D. Müller
  • 3,336
  • 4
  • 36
  • 84
0
votes
3 answers

Programatically creating dstreams in apache spark

I am writing some self contained integration tests around Apache Spark Streaming. I want to test that my code can ingest all kinds of edge cases in my simulated test data. When I was doing this with regular RDDs (not streaming). I could use my…
eshalev
  • 3,033
  • 4
  • 34
  • 48
0
votes
2 answers

Spark Streaming:how to sum up all result for several DStreams?

I am now using Spark Streaming + Kafka to construct my message processing system.But I have a little technical problem , I will describe it below: For example , I want to do a wordcount for each 10 minutes,So, in my earliest code,I set Batch…
wuchang
  • 3,003
  • 8
  • 42
  • 66