Questions tagged [dstream]

Discretized Streams (D-Stream) is an approach that handles streaming computations as a series of deterministic batch computations on small time intervals.

Discretized Streams (D-Stream) is an approach that handles streaming computations as a series of deterministic batch computations on small time intervals. The input data received during each interval is stored reliably across the cluster to form an input dataset for that interval. Once the time interval completes, this dataset is processed via deterministic parallel operations, such as map, reduce and groupBy, to produce new datasets representing program outputs or intermediate state

109 questions
0
votes
2 answers

Increase number of partitions in Dstream to be greater then Kafka partitions in Direct approach

Their are 32 Kafka partitions and 32 consumers as per Direct approach. But the data processing for 32 consumers is slow then Kafka rate(1.5x), which creates a backlog of data in Kafka. I Want to increase the number of partitions for Dstream received…
0
votes
1 answer

Kafka on Spark only reads realtime ingestion

Spark version = 2.3.0 Kafka version = 1.0.0 Sinppet of code being used: # Kafka Enpoints zkQuorum = '192.168.2.10:2181,192.168.2.12:2181' topic = 'Test_topic' # Create a kafka Stream kafkaStream = KafkaUtils.createStream(ssc, zkQuorum,…
steven
  • 644
  • 1
  • 11
  • 23
0
votes
1 answer

spark streaming DStream map vs foreachRDD, which is more efficient for transformation

Just for transformation, map and foreachRDD can achieve the same goal, but which one is more efficient? And why? for example,for a DStream[Int]: val newDs1=Ds.map(x=> x+1) val newDs2=Ds.foreachRDD (rdd=>rdd.map(x=> x+1)) I know foreachRDD will…
0
votes
1 answer

Unable to Iterate over the list of keys retrieved from coverting Dstream to List while using spark streaming with kafka

Below is the code for spark streaming with kafka. Here I am trying to get the keys for the batch as Dstream and then covert it to a LIST. In order to iterate over it and put data pertaining to each key in a hdfs folder named after the key. Key is…
Varun
  • 83
  • 1
  • 10
0
votes
0 answers

Spark Streaming in Standalone Cluster takes the same Kafka message more than once

My spark streaming application takes only once every record when I use it on local, but, when I deploy it on a standalone cluster it reads two times the same message from Kafka. Also, I've double checked that this is not a problem related to the…
ggagliano
  • 1,004
  • 1
  • 11
  • 27
0
votes
1 answer

Simulate RDD Dstream in PySpark from a series of offline events

I need to inject events saved to HDFS during online Kafka streaming back to DStream PySpark to undergo same algorithms processing. I found code example of Holden Karau that is "equivalent to a checkpointable, replayable, reliable message queue like…
Alex
  • 49
  • 4
0
votes
1 answer

Does Spark read data from Kafka partition into executor, for a batch which is queued?

During spark streaming with streaming-kafka-0-8-integration Direct Approach, If the batches are getting queued, will the executors pull the data for queued batches into their memory? If not, what is the harm in having a very long backlog of batches?
phoenix
  • 1
  • 1
  • 1
0
votes
1 answer

Unzipping zipped dstream packages

How to unzip the .dstream.Z package in Sun Solaris. I tried all the methods as under gunzip # gunzip pkg@1821417@27528.dstream.Z gzip: pkg@1821417@27528.dstream.Z: not in gzip format unzip # unzip pkg@1821417@27528.dstream.Z Archive: …
root
  • 15
  • 4
0
votes
1 answer

How to generate JavaPairInputDStream from JavaStreamingContext?

I am learning Apache Spark streaming and tried to generate JavaPairInputDStream from JavaStreamingContext. Below is my code: import java.util.ArrayList; import java.util.Arrays; import java.util.LinkedList; import java.util.List; import…
Joseph Hwang
  • 1,337
  • 3
  • 38
  • 67
0
votes
1 answer

Spark Streaming with Python: Joining two stream with respect to a particular attribute

I am receiving two socket streams S1 and S2 with schemas S1 and S2 respectively. I would like to join S1 and S2 with respect to attribute "a" using spark streaming. Following is my code: sc = SparkContext("local[3]", "StreamJoin") ssc =…
shaikh
  • 582
  • 6
  • 24
0
votes
0 answers

dstream[Double] to dstream scala

I am developing spark-consumer application which is consuming messages from kafka broker and i want to find average of that messages which are coming to spark consumer and finally i want to store that average into cassandra. val Array(brokers,…
0
votes
1 answer

sortByKey is not working on Dstream

I am using Transform API of Dstream(Spark Streaming) to sort the data. I am reading from TCP socket using netcat. Following the line of code used: myDStream.transform(rdd=>rdd.sortByKey()) It is unable to find function sortByKey. Could anyone please…
0
votes
0 answers

What's the difference between foreachRDD and transform in spark streaming?

To use RDD operations we can use either foreachRDD() or transorm() but I can not understand what's the difference between them.
Slim AZAIZ
  • 646
  • 1
  • 8
  • 21
0
votes
1 answer

Convert Dstream to dataframe using pyspark

How i can convert a DStream to an dataframe? here is my actual code localhost = "127.0.0.1" addresses = [(localhost, 9999)] schema = ['event', 'id', 'time','occurence'] flumeStream = FlumeUtils.createPollingStream(ssc, addresses) counts =…
Imane Jabal
  • 35
  • 1
  • 9
0
votes
1 answer

Is Dstream map and Dstream transform map the same in Spark?

Are the following two the same? val dstream = stream.window(Seconds(60), Seconds(1)) val x = dstream.map(x => ...) and val dstream = stream.window(Seconds(60), Seconds(1)) val x = dstream.transform(rdd => rdd.map(x => ...))
pythonic
  • 20,589
  • 43
  • 136
  • 219