Questions tagged [dstream]

Discretized Streams (D-Stream) is an approach that handles streaming computations as a series of deterministic batch computations on small time intervals.

Discretized Streams (D-Stream) is an approach that handles streaming computations as a series of deterministic batch computations on small time intervals. The input data received during each interval is stored reliably across the cluster to form an input dataset for that interval. Once the time interval completes, this dataset is processed via deterministic parallel operations, such as map, reduce and groupBy, to produce new datasets representing program outputs or intermediate state

109 questions

votes

1 answer

pyspark - error writing dstream to elasticsearch

I am having a problem indexing data from spark streaming (pyspark) to elasticserach. the data is of type dstream. Below how it looks (u'01B', 0) (u'1A5', 1) .... Here's the Elastic index I am using: index=clus and type=data GET…

asked Jul 25 '17 at 14:06

severine

votes

0 answers

Spark Streaming - How to get results out of foreachRDD function?

I'm trying to read Kafka messages using Spark Streaming, do some computations and send the results to another process. val jsonObject = new JSONObject val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc,…

scala apache-spark spark-streaming dstream

asked Jul 12 '17 at 00:21

sen

votes

1 answer

SparkStreaming application too slow

While develloping a SparkStreaming application (python), I'm not completely sure if I understand well how it works. I just have to read a json file stream (poping in a directory) and perform a join operation on each json object and a reference, and…

python apache-spark pyspark spark-streaming dstream

asked May 02 '17 at 13:13

Flibidi

votes

1 answer

Pyspark - Transfer control out of Spark Session (sc)

This is a follow up question on Pyspark filter operation on Dstream To keep a count of how many error messages/warning messages has come through for say a day, hour - how does one design the job. What I have tried: from __future__ import…

pyspark spark-streaming dstream

asked Feb 13 '17 at 09:26

GreenThumb

votes

1 answer

Pyspark filter operation on Dstream

I have been trying to extend the network word count to be able to filter lines based on certain keyword I am using spark 1.6.2 from __future__ import print_function import sys from pyspark import SparkContext from pyspark.streaming import…

pyspark dstream

asked Feb 10 '17 at 05:40

GreenThumb

votes

0 answers

spark stream kafka parallel receiver receive data unbalanced

I just want try to receive stream data from kafka in parallel. Here is my code: val myKafkaStream = (1 to numReceivers.toInt).map { i => KafkaUtils.createStream(ssc, zkQuorum, group, topicMap) } I run the code on the yarn , the numReceiver is…

apache-spark apache-kafka receiver dstream

asked Dec 08 '16 at 10:10

Stone

votes

1 answer

Spark Streaming fault tolerance on DStream batches

Suppose if a stream is received at time X. Suppose my batch duration is 1 minute. Now my executors are processing the first batch. But this execution takes 3 minutes till X+3. But at X+1 and X+2 we receive other two batches. Does that mean that at…

spark-streaming fault-tolerance data-loss dstream

asked Oct 31 '16 at 23:58

Vijay Krishna

1,037
13
19

votes

1 answer

Calculating derived value in Spark Streaming

I have two Key Value Pairs of the type org.apache.spark.streaming.dstream.DStream[Int]. First Key value pair is (word,frequency). Second key value pair is (Number of rows,Value). I would like to divide frequency by value in for each word. But, I…

scala spark-streaming dstream

asked Oct 24 '16 at 10:21

neoguy

votes

1 answer

How to perform zipping between two DStreams in Scala?

I have two windowed dstreams that I would like to zip like the normal zipping in RDDs. Note : The main goal is to calculate the mean and stdv of the window dstream in case there is a better way to calculate this.

apache-spark spark-streaming dstream

asked Aug 01 '16 at 09:30

Ahmed Kamal

1,478
14
27

votes

0 answers

Spark Streaming: Average of all the time

I wrote a Spark Streaming application which receives temperature values and calculates the average temperature of all time. For that i used the JavaPairDStream.updateStateByKey transaction to calculate it per device (separated by the Pair's key).…

apache-spark spark-streaming dstream atomic-long

asked Jun 30 '16 at 12:23

D. Müller

3,336
4
36
84

votes

0 answers

Spark Streaming - Travis CI and GitHub custom receiver - continuous data but empty RDD?

Lately, as a part of a scientific research, I've been developing an application that streams (or at least should) data from Travis CI and GitHub, using their REST API's. The purpose of this is to get insight into the commit-build relationship, in…

apache-spark spark-streaming rdd dstream

asked Mar 05 '16 at 14:05

dsafa

votes

1 answer

Spark : get Multiple DStream out of a single DStream

Is is possible to get multiple DStream out of a single DStream in spark. My use case is follows: I am getting Stream of log data from HDFS file. The log line contains an id (id=xyz). I need to process log line differently based on the id. So I…

apache-spark spark-streaming dstream

asked Jan 20 '16 at 10:17

Alok

1,374
3
18
44

votes

2 answers

Apache Spark Scala API: ReduceByKeyAndWindow in Scala

As I'm new to Spark's Scala API I have the following problem: In my java code I did a reduceByKeyAndWindow transformation, but now I saw, that there's only a reduceByWindow (as there's also no PairDStream in Scala). However, I got the first steps in…

scala apache-spark spark-streaming dstream

asked Dec 17 '15 at 19:00

D. Müller

3,336
4
36
84

votes

3 answers

Programatically creating dstreams in apache spark

I am writing some self contained integration tests around Apache Spark Streaming. I want to test that my code can ingest all kinds of edge cases in my simulated test data. When I was doing this with regular RDDs (not streaming). I could use my…

testing apache-spark dstream

asked Oct 22 '15 at 13:02

eshalev

3,033
4
34
48

votes

2 answers

Spark Streaming:how to sum up all result for several DStreams?

I am now using Spark Streaming + Kafka to construct my message processing system.But I have a little technical problem , I will describe it below: For example , I want to do a wordcount for each 10 minutes,So, in my earliest code,I set Batch…

spark-streaming rdd dstream

asked Apr 01 '15 at 03:13

wuchang

3,003
8
42
66

Prev 1 2 3 4 5 6

8 Next