Questions tagged [dstream]

Discretized Streams (D-Stream) is an approach that handles streaming computations as a series of deterministic batch computations on small time intervals.

Discretized Streams (D-Stream) is an approach that handles streaming computations as a series of deterministic batch computations on small time intervals. The input data received during each interval is stored reliably across the cluster to form an input dataset for that interval. Once the time interval completes, this dataset is processed via deterministic parallel operations, such as map, reduce and groupBy, to produce new datasets representing program outputs or intermediate state

109 questions
0
votes
0 answers

How to setup a python feeder to connect to a Spark Dstream

This is more of a conceptual question at this point. As a part of a class assignment, we are given a python script that writes data to stdout. The first step in the assignment is to setup this python feeder to feed into a Spark DStream. I've been…
0
votes
0 answers

Kafka Offset Management with spark streaming with dstream with HIVE as a sink

We want to achieve Kafka Offset Management in Spark DStream with HIVE as a sink instead of no sql db such as HBase or Mongo DB. Could you please suggest a high level solution to implement this functionality. Thanks in advance.
Rohan
  • 3
  • 2
0
votes
1 answer

How to generate a DStream from RateStreamSource in Spark

I have a case class in Scala like this case class RemoteCopyGroup( ts: Long, systemId: String, name: String, id: Int, role: String, mode: String, remoteGroupName: String) object RemoteCopyGroup { // to be removed val…
user9920500
  • 606
  • 7
  • 21
0
votes
0 answers

Why can I convert a DStream[String] to DStream[List[String]] but not to DStream[DataFrame]?

My question is about DStream handling in legacy spark straming. I would like to know why when I convert a DStream[String] to a DStream[List[String]] everything is ok, but when I try to convert this generated list to a dataframe using the toDF()…
0
votes
1 answer

count number of elements in each pyspark Dstream

I'm looking to find a way to count the number of elements (or number of RDDs) which I receive each time in my Dstream created in pyspark, that I'm using. if you know a way that could help me, I will be pleased. Thanks.
0
votes
0 answers

ConstantInputDStream.print() does nothing

I'm trying to get a simple DStream to print but with no success. See code below. I'm using a Databricks notebook in Azure. import org.apache.spark.streaming.{ StreamingContext, Seconds } val ssc = new StreamingContext(sc, batchDuration =…
Koenig Lear
  • 2,366
  • 1
  • 14
  • 29
0
votes
1 answer

Spark never stops processing first batch

I am trying to run an application I found on github, this one: https://github.com/CSIRT-MU/AIDA-Framework I am running it in a Ubuntu 18.04.1 virtual machine. At some point in its data processing pipeline it uses spark and it seems to get stuck at…
0
votes
1 answer

DStream to Rdd in Spark-Straming

I have a DStream[String,String] and I need to convert it to RDD[String,String]. Is there any way to do it? I need to do using Scala language. Thanks in advance!!
0
votes
1 answer

How to gracefully stop a spark dStream process

I'm trying to read data from a kafka stream, process it and save it to a report. I'd like to run this job once a day. I'm using dStreams. Is there an equivalent of trigger(Trigger.Once) in dStreams I could use for this scenario. appreciate…
Anil
  • 420
  • 2
  • 16
0
votes
1 answer

How to retrieve location when streaming twitter data using Pyspark

I am working on streaming tweets using PYSpark in real-time. I want to retrieve text, location, username. Currently, I am receiving tweet text only. Is there is anyway to get the location also. lines = ssc.socketTextStream("localhost", 5550) I'm…
0
votes
1 answer

Best practice for key with two values

So far I have a JavaDStream which first looked like this: Value --------------------- a,apple,spain b,orange,italy c,apple,italy a,apple,italy a,orange,greece First i splitted up the rows and mapped it to a Key-Value pair in a…
wdmv1981
  • 49
  • 7
0
votes
1 answer

Case Class within foreachRDD causes Serialization Error

I can can create a DF inside foreachRDD if I do not try and use a Case Class and simply let default names for columns be made with toDF() or if I assign them via toDF("c1, "c2"). As soon as I try and use a Case Class, and having looked at the…
thebluephantom
  • 16,458
  • 8
  • 40
  • 83
0
votes
1 answer

Scala Spark Streaming unit test with spark-testing-base throws error

I was trying to run a unit test on my spark streaming code with spark-testing-base. And I am having trouble running their sample-codes. Here is the code snippet I copied import com.holdenkarau.spark.testing.SharedSparkContext import…
Man-Kit Yau
  • 149
  • 1
  • 10
0
votes
1 answer

Spark Streaming Run Actions On DStream Asynchronously

I'm writing a program for data ingestion. Read from Kafka to DStream split the Dstrem to 3 streams and executing Actions on each one: val stream = createSparkStream(Globals.configs, ssc) val s1 = stream.filter() val s2 =…
Alex
  • 111
  • 10
0
votes
1 answer

pyspark updateStateByKey fails, when calling my function

I'm just trying to run sample code of statefu lstreaming, but it fails with error. Can not get why it occures. Spark 2.3 with 3.6 python on cloudera vm 5.13.3 Running options: --master local[*] --queue PyCharmSpark pyspark-shell My code is: from…
Dipas
  • 294
  • 2
  • 9
  • 21