Questions tagged [dstream]

Discretized Streams (D-Stream) is an approach that handles streaming computations as a series of deterministic batch computations on small time intervals.

Discretized Streams (D-Stream) is an approach that handles streaming computations as a series of deterministic batch computations on small time intervals. The input data received during each interval is stored reliably across the cluster to form an input dataset for that interval. Once the time interval completes, this dataset is processed via deterministic parallel operations, such as map, reduce and groupBy, to produce new datasets representing program outputs or intermediate state

109 questions
1
vote
1 answer

Perform Multiple Transformation on a DStream

I am fairly new Spark Streaming I have the streaming data containing two values x y. For example 1 300 2 8754 3 287 etc. Out of the streamed data, I want to get the smallest y value, largest y value, and the mean of the x values. This needs to be…
Tsume
  • 907
  • 2
  • 11
  • 21
1
vote
3 answers

merge spark dStream with variable to saveToCassandra()

I have a DStream[String, Int] with pairs of word counts, e.g. ("hello" -> 10). I want to write these counts to cassandra with a step index. The index is initialized as var step = 1 and is incremented with each microbatch processed. The cassandra…
1
vote
0 answers

Spark Streaming: From DStream to Pandas Dataframe

In the snippet below I try to transform a DStream of temperatures (received from Kafka) into a pandas Dataframe. def main_process(time, dStream): print("========= %s =========" % str(time)) try: # Get the singleton instance of SparkSession …
HappyCane
  • 363
  • 1
  • 2
  • 10
1
vote
0 answers

Spark JSON DStream Print() / saveAsTextFiles not working

Issue Description: Spark Version: 1.6.2 Execution: Spark-shell (REPL) master = local[2] (tried local[*]) example.json is as below: {"name":"D2" ,"lovesPandas":"Y"} {"name":"D3" ,"lovesPandas":"Y"} {"name":"D4" ,"lovesPandas":"Y"} {"name":"D5"…
RGuy
  • 11
  • 2
1
vote
2 answers

Reading data from HBase through Spark Streaming

So my project flow is Kafka -> Spark Streaming ->HBase Now I want to read data again from HBase which will go over the table created by the previous job and do some aggregation and store it in another table in different column format Kafka -> Spark…
1
vote
1 answer

How to get the cartesian product of two DStream in Spark Streaming with Scala?

I have two DStreams. Let A:DStream[X] and B:DStream[Y]. I want to get the cartesian product of them, in other words, a new C:DStream[(X, Y)] containing all the pairs of X and Y values. I know there is a cartesian function for RDDs. I was only able…
Coukaratcha
  • 133
  • 2
  • 11
1
vote
1 answer

Distinct Element across dstream

I am working on window dstreams wherein each dstream contains 3 rdd with following keys: a,b,c b,c,d c,d,e d,e,f I want to get only unique keys across all dstream a,b,c,d,e,f How to do it in spark streaming?
vkb
  • 458
  • 1
  • 7
  • 18
1
vote
1 answer

How to Combine two Dstreams using Pyspark (similar to .zip on normal RDD)

I know that we can combine(like cbind in R) two RDDs as below in pyspark: rdd3 = rdd1.zip(rdd2) I want to perform the same for two Dstreams in pyspark. Is it possible or any alternatives? In fact, I am using a MLlib randomforest model to predict…
Obaid
  • 237
  • 2
  • 14
1
vote
2 answers

How to solve Type mismatch issue (expected: Double, actual: Unit)

Here is my function that calculates root mean squared error. However the last line cannot be compiled because of the error Type mismatch issue (expected: Double, actual: Unit). I tried many different ways to solve this issue, but still without…
Klue
  • 1,317
  • 5
  • 22
  • 43
1
vote
1 answer

combineByKey on a Dstream throws an error

I have a dstream with tuples (String, Int) in it When I try combineByKey, it says me to specify parameter: Partitioner my_dstream.combineByKey( (v) => (v,1), (acc:(Int, Int), v) => (acc._1 + v, acc._2 + 1), (acc1:(Int, Int),…
Vadym B.
  • 681
  • 7
  • 21
1
vote
0 answers

Parallel reduceByKeyAndWindow()s with different time values

I am working on Spark Streaming on a use case which demands 4 different outputs computed on different window lengths. In particular, I need my program to output the result of the computation every second based on 4 different time windows (windows…
luke
  • 375
  • 1
  • 2
  • 12
1
vote
3 answers

Spark Streaming not distributing task to nodes on cluster

I have two node standalone cluster for spark stream processing. below is my sample code which demonstrate process I am executing. sparkConf.setMaster("spark://rsplws224:7077") val ssc=new StreamingContext() println(ssc.sparkContext.master) val…
Jigar Parekh
  • 6,163
  • 7
  • 44
  • 64
0
votes
0 answers

How to use a Dataframe, which is created from Dstream, outside of foreachRDD block?

i've been tried to working on spark streaming. My problem is I want to use wordCountsDataFrame again outside of the foreach block. i want to conditionally join wordCountsDataFrame and another dataframe that is created from Dstream. Is there any…
0
votes
2 answers

How to calculate average by category in pyspark streaming?

I have csv data coming as DStreams from traffic counters. Sample is as…
0
votes
1 answer

Read Avro records from Kafka using Spark Dstreams

I'm using spark 2.3 and trying to stream data from Kafka using Dstreams (using DStreams to acheive a specific usecase which we were not able to using Structured Streaming). The Kafka topic contains data in avro format. I want the read that data…
BHC
  • 77
  • 9