Questions tagged [dstream]

Discretized Streams (D-Stream) is an approach that handles streaming computations as a series of deterministic batch computations on small time intervals.

Discretized Streams (D-Stream) is an approach that handles streaming computations as a series of deterministic batch computations on small time intervals. The input data received during each interval is stored reliably across the cluster to form an input dataset for that interval. Once the time interval completes, this dataset is processed via deterministic parallel operations, such as map, reduce and groupBy, to produce new datasets representing program outputs or intermediate state

109 questions
3
votes
1 answer

How to create a DStream from a List of string?

I have a list of string, but i cant find a way to change the list to a DStream of spark streaming. I tried this: val tmpList = List("hi", "hello") val rdd = sqlContext.sparkContext.parallelize(Seq(tmpList)) val rowRdd = rdd.map(v => Row(v:…
pauly
  • 51
  • 3
3
votes
1 answer

Sorting a DStream and taking topN

I have some DStream in Spark Scala and I want to sort it then take the top N. The problem is that whenever I try to run it I get NotSerializableException and the exception message says: This is because the DStream object is being referred to from…
Ahmed El-Gamal
  • 180
  • 3
  • 18
3
votes
1 answer

How can I return two DStreams in a function after using the filter transformation in spark streaming?

In a function, is there a way to return two DStreams after using filter ? For example when I filter a DStream, the filtered ones will be stored in a DStream and the unfiltered ones will be stored in another DStream.
Ronald Segan
  • 215
  • 2
  • 11
2
votes
1 answer

Read Nested JSON Data in DStrem in pyspark

I have written following code to stream data from Tweepy API. And I am getting data inside stream object. But unable to get streamp["user"]["followers_count"] but don't know how to get it. I also tried jsonLines = lines.flatMap(lambda…
Prince Kumar Sharma
  • 12,591
  • 4
  • 59
  • 90
2
votes
2 answers

Spark Streaming tuning number of records per batch size not working?

My spark streaming app is reading from kafka using DStream approach and I'm trying to get the batch size to process 60,000 messages in 10 seconds. What I've done, Created a topic with 3 partitions spark.streaming.kafka.maxRatePerPartition =…
alex
  • 1,905
  • 26
  • 51
2
votes
2 answers

Sorting a JavaDStream - Spark Streaming

I have an application which works with JavaDStreams objects. This is a piece of code, where I compute the frequencies the words appear with. JavaPairDStream wordCounts = words.mapToPair( new PairFunction
sirdan
  • 1,018
  • 2
  • 13
  • 34
2
votes
1 answer

Kafka topics to Spark Streaming DStream, how to get a Json

I'm trying to get the information from a Kafka Topic with Spark Streaming and then parse the json I get in the topic. In order to get the topic in a DStream I use a stringReader and then I use foreach to get every RDD from the…
jsonH
  • 61
  • 6
2
votes
1 answer

Invoking a utility(external) inside Spark streaming job

I have a streaming job consuming from Kafka (using createDstream). its stream of "id" [id1,id2,id3 ..] I have an utility or an api which accepts an Array of id's and does some external call and receives back some info say "t" for each id…
2
votes
1 answer

Best solution to accumulate Spark Streaming DStream

I'm looking for the best solution to accumulate the last N number of messages in a Spark DStream. I'd also like to specify the number of messages to retain. For example, given the following stream, I'd like to retain the last 3 elements: Iteration …
user278530
  • 83
  • 2
  • 11
2
votes
1 answer

Perform actions before end of the micro-batch in Spark Streaming

Is there a possibility to perform some action at the end of each micro-batch inside the DStream in Spark Streaming? My aim is to compute number of the events processed by Spark. Spark Streaming gives me some numbers, but the average also seems to…
2
votes
1 answer

kafka directstream dstream map does not print

I have this simple Kafka Stream val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet) // Each Kafka message is a flight val flights = messages.map(_._2) flights.foreachRDD( rdd =>…
Sudheer Palyam
  • 2,499
  • 2
  • 23
  • 28
2
votes
0 answers

How many RDD in the resulting DStream of reduceByKeyAndWindow

I am currently working on a small spark job to compute stock correlation matrix from a DStream. From a DStream[(time, quote)], I need to aggregate quotes (double) by time (Long) among multiple rdds, before computing correlations (considering all…
2
votes
2 answers

Iterative algorithms with Spark streaming

So I understand that Spark can perform iterative algorithms on single RDDs for example Logistic regression. val points = spark.textFile(...).map(parsePoint).cache() var w = Vector.random(D) // current separating plane for (i <- 1 to…
user1893354
  • 5,778
  • 12
  • 46
  • 83
1
vote
0 answers

Lambda function to groupby first substring?

I am trying to write a lambda function that groups words based on their first substring So the words are coming in like a,word b,can a,eat c,vegetables b,if So far I have a lambda function combineddatardd.combineByKey(lambda v:\[v\],lambda…
1
vote
2 answers

Get Max & Min value for each key in the RDD

spark = SparkSession.builder.getOrCreate() sc = spark.sparkContext ssc = StreamingContext(sc , 10) rdd = ssc.sparkContext.parallelize(pd_binance) rdd.take(1) Here is a small portion of the result: [['0.02703300', '1.30900000'], ['0.02703300',…
Saif Nitham
  • 11
  • 1
  • 2