Questions tagged [dstream]

Discretized Streams (D-Stream) is an approach that handles streaming computations as a series of deterministic batch computations on small time intervals.

Discretized Streams (D-Stream) is an approach that handles streaming computations as a series of deterministic batch computations on small time intervals. The input data received during each interval is stored reliably across the cluster to form an input dataset for that interval. Once the time interval completes, this dataset is processed via deterministic parallel operations, such as map, reduce and groupBy, to produce new datasets representing program outputs or intermediate state

109 questions

votes

1 answer

How to create a DStream from a List of string?

I have a list of string, but i cant find a way to change the list to a DStream of spark streaming. I tried this: val tmpList = List("hi", "hello") val rdd = sqlContext.sparkContext.parallelize(Seq(tmpList)) val rowRdd = rdd.map(v => Row(v:…

apache-spark streaming dstream

asked Oct 27 '16 at 03:37

pauly

votes

1 answer

Sorting a DStream and taking topN

I have some DStream in Spark Scala and I want to sort it then take the top N. The problem is that whenever I try to run it I get NotSerializableException and the exception message says: This is because the DStream object is being referred to from…

scala apache-spark spark-streaming top-n dstream

asked Aug 07 '16 at 10:15

Ahmed El-Gamal

votes

1 answer

How can I return two DStreams in a function after using the filter transformation in spark streaming?

In a function, is there a way to return two DStreams after using filter ? For example when I filter a DStream, the filtered ones will be stored in a DStream and the unfiltered ones will be stored in another DStream.

scala apache-spark spark-streaming rdd dstream

asked Apr 21 '16 at 13:09

Ronald Segan

votes

1 answer

Read Nested JSON Data in DStrem in pyspark

I have written following code to stream data from Tweepy API. And I am getting data inside stream object. But unable to get streamp["user"]["followers_count"] but don't know how to get it. I also tried jsonLines = lines.flatMap(lambda…

json python-3.x pyspark rdd dstream

asked May 08 '21 at 09:56

Prince Kumar Sharma

12,591
4
59
90

votes

2 answers

Spark Streaming tuning number of records per batch size not working?

My spark streaming app is reading from kafka using DStream approach and I'm trying to get the batch size to process 60,000 messages in 10 seconds. What I've done, Created a topic with 3 partitions spark.streaming.kafka.maxRatePerPartition =…

apache-spark spark-streaming dstream

asked Jul 08 '19 at 15:13

alex

1,905
26
51

votes

2 answers

Sorting a JavaDStream - Spark Streaming

I have an application which works with JavaDStreams objects. This is a piece of code, where I compute the frequencies the words appear with. JavaPairDStream wordCounts = words.mapToPair( new PairFunction

java apache-spark spark-streaming dstream

asked Jun 10 '17 at 15:45

sirdan

1,018
2
13
34

votes

1 answer

Kafka topics to Spark Streaming DStream, how to get a Json

I'm trying to get the information from a Kafka Topic with Spark Streaming and then parse the json I get in the topic. In order to get the topic in a DStream I use a stringReader and then I use foreach to get every RDD from the…

json scala apache-kafka spark-streaming dstream

asked May 02 '17 at 13:37

jsonH

votes

1 answer

Invoking a utility(external) inside Spark streaming job

I have a streaming job consuming from Kafka (using createDstream). its stream of "id" [id1,id2,id3 ..] I have an utility or an api which accepts an Array of id's and does some external call and receives back some info say "t" for each id…

scala apache-spark spark-streaming rdd dstream

asked Jan 06 '17 at 14:10

Rushabh Mehta

votes

1 answer

Best solution to accumulate Spark Streaming DStream

I'm looking for the best solution to accumulate the last N number of messages in a Spark DStream. I'd also like to specify the number of messages to retain. For example, given the following stream, I'd like to retain the last 3 elements: Iteration …

scala apache-spark spark-streaming dstream

asked Aug 15 '16 at 19:19

user278530

votes

1 answer

Perform actions before end of the micro-batch in Spark Streaming

Is there a possibility to perform some action at the end of each micro-batch inside the DStream in Spark Streaming? My aim is to compute number of the events processed by Spark. Spark Streaming gives me some numbers, but the average also seems to…

performance apache-spark streaming spark-streaming dstream

asked May 12 '16 at 12:48

chAlexey

votes

1 answer

kafka directstream dstream map does not print

I have this simple Kafka Stream val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet) // Each Kafka message is a flight val flights = messages.map(_._2) flights.foreachRDD( rdd =>…

scala spark-streaming rdd dstream

asked Apr 12 '16 at 08:52

Sudheer Palyam

2,499
2
23
28

votes

0 answers

How many RDD in the resulting DStream of reduceByKeyAndWindow

I am currently working on a small spark job to compute stock correlation matrix from a DStream. From a DStream[(time, quote)], I need to aggregate quotes (double) by time (Long) among multiple rdds, before computing correlations (considering all…

window reduce spark-streaming rdd dstream

asked Jun 19 '15 at 15:48

Michael Benguigui

votes

2 answers

Iterative algorithms with Spark streaming

So I understand that Spark can perform iterative algorithms on single RDDs for example Logistic regression. val points = spark.textFile(...).map(parsePoint).cache() var w = Vector.random(D) // current separating plane for (i <- 1 to…

scala iteration apache-spark dstream

asked Mar 15 '15 at 16:51

user1893354

5,778
12
46
83

vote

0 answers

Lambda function to groupby first substring?

I am trying to write a lambda function that groups words based on their first substring So the words are coming in like a,word b,can a,eat c,vegetables b,if So far I have a lambda function combineddatardd.combineByKey(lambda v:\[v\],lambda…

apache-spark pyspark dstream

asked Mar 28 '22 at 03:47

tripityCoder

vote

2 answers

Get Max & Min value for each key in the RDD

spark = SparkSession.builder.getOrCreate() sc = spark.sparkContext ssc = StreamingContext(sc , 10) rdd = ssc.sparkContext.parallelize(pd_binance) rdd.take(1) Here is a small portion of the result: [['0.02703300', '1.30900000'], ['0.02703300',…

python apache-spark pyspark rdd dstream

asked Jan 02 '21 at 15:36

Saif Nitham

Prev 1

3 4 5 6 7 8 Next