Questions tagged [dstream]

Discretized Streams (D-Stream) is an approach that handles streaming computations as a series of deterministic batch computations on small time intervals.

Discretized Streams (D-Stream) is an approach that handles streaming computations as a series of deterministic batch computations on small time intervals. The input data received during each interval is stored reliably across the cluster to form an input dataset for that interval. Once the time interval completes, this dataset is processed via deterministic parallel operations, such as map, reduce and groupBy, to produce new datasets representing program outputs or intermediate state

109 questions

votes

2 answers

Increase number of partitions in Dstream to be greater then Kafka partitions in Direct approach

Their are 32 Kafka partitions and 32 consumers as per Direct approach. But the data processing for 32 consumers is slow then Kafka rate(1.5x), which creates a backlog of data in Kafka. I Want to increase the number of partitions for Dstream received…

asked Aug 22 '18 at 12:44

Vinayak Mishra

votes

1 answer

Kafka on Spark only reads realtime ingestion

Spark version = 2.3.0 Kafka version = 1.0.0 Sinppet of code being used: # Kafka Enpoints zkQuorum = '192.168.2.10:2181,192.168.2.12:2181' topic = 'Test_topic' # Create a kafka Stream kafkaStream = KafkaUtils.createStream(ssc, zkQuorum,…

apache-spark pyspark apache-kafka spark-streaming dstream

asked Aug 20 '18 at 12:37

steven

votes

1 answer

spark streaming DStream map vs foreachRDD, which is more efficient for transformation

Just for transformation, map and foreachRDD can achieve the same goal, but which one is more efficient? And why? for example,for a DStream[Int]: val newDs1=Ds.map(x=> x+1) val newDs2=Ds.foreachRDD (rdd=>rdd.map(x=> x+1)) I know foreachRDD will…

apache-spark streaming dstream

asked Aug 18 '18 at 03:36

Lantao Xie

votes

1 answer

Unable to Iterate over the list of keys retrieved from coverting Dstream to List while using spark streaming with kafka

Below is the code for spark streaming with kafka. Here I am trying to get the keys for the batch as Dstream and then covert it to a LIST. In order to iterate over it and put data pertaining to each key in a hdfs folder named after the key. Key is…

apache-spark apache-kafka spark-streaming rdd dstream

asked Jul 06 '18 at 20:45

Varun

votes

0 answers

Spark Streaming in Standalone Cluster takes the same Kafka message more than once

My spark streaming application takes only once every record when I use it on local, but, when I deploy it on a standalone cluster it reads two times the same message from Kafka. Also, I've double checked that this is not a problem related to the…

apache-spark apache-kafka spark-streaming dstream

asked Jun 03 '18 at 08:11

ggagliano

1,004
1
11
27

votes

1 answer

Simulate RDD Dstream in PySpark from a series of offline events

I need to inject events saved to HDFS during online Kafka streaming back to DStream PySpark to undergo same algorithms processing. I found code example of Holden Karau that is "equivalent to a checkpointable, replayable, reliable message queue like…

apache-spark apache-kafka streaming dstream

asked Apr 15 '18 at 08:50

Alex

votes

1 answer

Does Spark read data from Kafka partition into executor, for a batch which is queued?

During spark streaming with streaming-kafka-0-8-integration Direct Approach, If the batches are getting queued, will the executors pull the data for queued batches into their memory? If not, what is the harm in having a very long backlog of batches?

apache-spark apache-kafka spark-streaming dstream

asked Feb 05 '18 at 01:30

phoenix

votes

1 answer

Unzipping zipped dstream packages

How to unzip the .dstream.Z package in Sun Solaris. I tried all the methods as under gunzip # gunzip pkg@1821417@27528.dstream.Z gzip: pkg@1821417@27528.dstream.Z: not in gzip format unzip # unzip pkg@1821417@27528.dstream.Z Archive: …

solaris sun sunos dstream

asked Jan 31 '18 at 18:17

root

votes

1 answer

How to generate JavaPairInputDStream from JavaStreamingContext?

I am learning Apache Spark streaming and tried to generate JavaPairInputDStream from JavaStreamingContext. Below is my code: import java.util.ArrayList; import java.util.Arrays; import java.util.LinkedList; import java.util.List; import…

java apache-spark spark-streaming dstream java-pair-rdd

asked Dec 18 '17 at 23:32

Joseph Hwang

1,337
3
38
67

votes

1 answer

Spark Streaming with Python: Joining two stream with respect to a particular attribute

I am receiving two socket streams S1 and S2 with schemas S1 and S2 respectively. I would like to join S1 and S2 with respect to attribute "a" using spark streaming. Following is my code: sc = SparkContext("local[3]", "StreamJoin") ssc =…

python apache-spark join pyspark dstream

asked Dec 05 '17 at 04:03

shaikh

votes

0 answers

dstream[Double] to dstream scala

I am developing spark-consumer application which is consuming messages from kafka broker and i want to find average of that messages which are coming to spark consumer and finally i want to store that average into cassandra. val Array(brokers,…

scala apache-spark apache-kafka spark-streaming dstream

asked Dec 04 '17 at 14:39

Arpit Shah

votes

1 answer

sortByKey is not working on Dstream

I am using Transform API of Dstream(Spark Streaming) to sort the data. I am reading from TCP socket using netcat. Following the line of code used: myDStream.transform(rdd=>rdd.sortByKey()) It is unable to find function sortByKey. Could anyone please…

apache-spark spark-streaming dstream

asked Nov 25 '17 at 18:06

Sachit Murarka

votes

0 answers

What's the difference between foreachRDD and transform in spark streaming?

To use RDD operations we can use either foreachRDD() or transorm() but I can not understand what's the difference between them.

scala apache-spark spark-streaming dstream

asked Oct 24 '17 at 09:54

Slim AZAIZ

votes

1 answer

Convert Dstream to dataframe using pyspark

How i can convert a DStream to an dataframe? here is my actual code localhost = "127.0.0.1" addresses = [(localhost, 9999)] schema = ['event', 'id', 'time','occurence'] flumeStream = FlumeUtils.createPollingStream(ssc, addresses) counts =…

pyspark apache-spark-sql dstream

asked Oct 19 '17 at 16:10

Imane Jabal

votes

1 answer

Is Dstream map and Dstream transform map the same in Spark?

Are the following two the same? val dstream = stream.window(Seconds(60), Seconds(1)) val x = dstream.map(x => ...) and val dstream = stream.window(Seconds(60), Seconds(1)) val x = dstream.transform(rdd => rdd.map(x => ...))

scala apache-spark stream dstream

asked Oct 05 '17 at 09:02

pythonic

20,589
43
136
219

Prev 1 2 3 4 5

7 8 Next