Questions tagged [dstream]

Discretized Streams (D-Stream) is an approach that handles streaming computations as a series of deterministic batch computations on small time intervals.

Discretized Streams (D-Stream) is an approach that handles streaming computations as a series of deterministic batch computations on small time intervals. The input data received during each interval is stored reliably across the cluster to form an input dataset for that interval. Once the time interval completes, this dataset is processed via deterministic parallel operations, such as map, reduce and groupBy, to produce new datasets representing program outputs or intermediate state

109 questions
1
vote
0 answers

Reg: Parallelizing RDD partitions in Spark executors

I am new to spark and trying out a sample Spark Kafka Integration. What I have done is posted jsons from single partitioned…
sunnydev
  • 11
  • 1
1
vote
1 answer

Spark's socket text stream is empty

I am following Spark's streaming guide. Instead of using nc -lk 9999, I have created my own simple Python server as follows. As can be seen from the code below, it will randomly generate the letters a through z. import socketserver import time from…
1
vote
2 answers

Constructing window based on message timestamps in Spark DStream

I'm receiving DStream from Kafka and I want to group all messages in some sliding window by keys. The point is that this window need to be based on timestamps provided in each message (separate field): Message…
Developer87
  • 2,448
  • 4
  • 23
  • 43
1
vote
1 answer

Kafka - Spark Streaming Integration: DStreams and Task reuse

I am trying to understand the internals of Spark Streaming (not Structured Streaming), specifically the way tasks see the DStream. I am going over the source code of Spark in scala, here. I understand the call stack: ExecutorCoarseGrainedBackend…
1
vote
1 answer

Spark QueueStream never exhausted

Puzzled on a piece of code I borrowed from the internet for research purposes. This is the code: import org.apache.spark.sql.SparkSession import org.apache.spark.rdd.RDD import org.apache.spark.streaming.{Seconds, StreamingContext} import…
thebluephantom
  • 16,458
  • 8
  • 40
  • 83
1
vote
1 answer

Join Dstream[Document] and Rdd by key Spark Scala

Here is my code: ssc =streamingcontext(sparkcontext,Seconds(time)) spark = sparksession.builder.config(properties).getorcreate() val Dstream1: ReceiverInputDstream[Document] = ssc.receiverStream(properties) // Dstream1 has Id1 and other…
1
vote
1 answer

Using Map in PySpark to parse and assign column names

Here is what I am trying to do. The input data looks like this(Tab seperated): 12/01/2018 user1 123.123.222.111 23.3s 12/01/2018 user2 123.123.222.116 21.1s The data is coming in through Kafka and is being parsed with the following…
steven
  • 644
  • 1
  • 11
  • 23
1
vote
1 answer

How to merge multiple DStreams in spark using scala?

I have three incoming streams from Kafka. I parse the streams received as JSON and extract them to appropriate case classes and form DStreams of the following schema: case class Class1(incident_id: String, crt_object_id: String, …
1
vote
1 answer

pyspark: train kmeans streaming with data retrieved from kafka

I want to train a streaming kmeans model with data consumed from a kafka topic. My problem is how to present the data for kmeans streamig model sc = SparkContext(appName="PythonStreamingKafka") ssc = StreamingContext(sc, 30) zkQuorum, topic =…
severine
  • 305
  • 1
  • 3
  • 11
1
vote
1 answer

Apache Spark streaming - Timeout long-running batch

I'm setting up a Apache Spark long-running streaming job to perform (non-parallelized) streaming using InputDStream. What I'm trying to achieve is that when a batch on the queue takes too long (based on a user defined timeout), I want to be able to…
1
vote
1 answer

Not able to persist the DStream for use in next batch

JavaRDD history_ = sc.emptyRDD(); java.util.Queue > queue = new LinkedList>(); queue.add(history_); JavaDStream history_dstream = ssc.queueStream(queue); JavaPairDStream>…
JSR29
  • 354
  • 1
  • 5
  • 17
1
vote
1 answer

Scala Spark : trying to avoid type erasure when using overload

I'm relatively new to Scala/Spark I'm trying to overload one function depending on the class type into a DStream def persist(service1DStream: DStream[Service1]): Unit = {...} def persist(service2DStream: DStream[Service2]): Unit = {...} I'm getting…
Fares
  • 605
  • 4
  • 19
1
vote
1 answer

Scala - Spark Dstream operation similar to Cbind in R

1) I am trying to use MLlib Random Forest . my final output should have 2 columns id, predicted_value 1, 0.5 2, 0.4 my feature sets are training data and scoring --- train , score but when I train and score I drop the id field as it could…
1
vote
1 answer

Spark streaming JavaPairDStream to text file

I am quite new on Spark streaming, and I am stuck saving my output. My question is, how can I save the output of my JavaPairDStream in a text file, which is updated for each file only with the elements inside the DStream? For example, with the…
Luis_MG
  • 65
  • 7
1
vote
1 answer

Spark streaming reduce by multiple key Java

I am quite new on Spark Streaming and I am getting stuck trying to figure out how to handle this problem since I found a lot of examples for single (K,V) pairs but anything further. I would appreciate some help in order to find the best approach…
Luis_MG
  • 65
  • 7