Questions tagged [spark-streaming]

Spark Streaming is an extension of the core Apache Spark API that enables high-throughput, fault-tolerant stream processing of live data streams. From the version 1.3.0, it supports exactly-once processing semantics, even in face of failures.

5565 questions
2
votes
0 answers

Spark Streaming with large messages java.lang.OutOfMemoryError: Java heap space

I am using Spark Streaming 1.6.1 with Kafka0.9.0.1 (createStreams API) HDP 2.4.2, My use case sends large messages to Kafka Topics ranges from 5MB to 30 MB in such cases Spark Streaming fails to complete its job and crashes with below exception.I am…
2
votes
2 answers

DStream checkpointing has been enabled but the DStreams with their functions are not serializable

I want to send DStream to Kafka , but it doesn't still work. searchWordCountsDStream.foreachRDD(rdd => rdd.foreachPartition( partitionOfRecords => { val props = new HashMap[String, Object]() …
Kof
  • 65
  • 2
  • 5
2
votes
2 answers

Zeppelin Twitter Streaming Example Not Working

I am trying to run Twitter Streaming Example in Zeppelin. After I searched around, I added "org.apache.bahir:spark-streaming-twitter_2.11:2.0.0" into Spark Interpreter. So I can make the first part work, as in: Apache Zeppelin 0.6.1: Run Spark 2.0…
user1828513
  • 367
  • 2
  • 7
  • 16
2
votes
1 answer

RDD toDF() : Erroneous Behavior

I built a SparkStreaming App that fetches content from A Kafka Queue and intends to put the data into a MySQL table after some pre-processing and structuring. I call the 'foreachRDD' method on the SparkStreamingContext. The issue that I'm facing is…
arshellium
  • 215
  • 1
  • 6
  • 17
2
votes
0 answers

Share variable across spark stream

How can I share variables across spark streams in pyspark. I'm trying to share a datafame that holds various values for a combination of features like example platform etc. The program works once when the global variable is first initialized. It…
dvshekar
  • 93
  • 11
2
votes
1 answer

Flow time stamp through streaming functions

How/is it possible to generate a random number or obtain system time for each time a batch is run with Spark Streaming? I have two functions which process a batch of messages: 1 - First processes the Key, creates a file (csv) and writes headers 2 -…
Ken Alton
  • 686
  • 1
  • 9
  • 21
2
votes
1 answer

Spark UI's kill is not killing Driver

I am trying to kill my spark-kafka streaming job from Spark UI. It is able to kill the application but the driver is still running. Can anyone help me with this. I am good with my other streaming jobs. only one of the streaming jobs is giving this…
AKC
  • 953
  • 4
  • 17
  • 46
2
votes
1 answer

How to join two (or more) streams (JavaDStream) in apache spark

We have a spark streaming application that consumes Gnip compliance stream. In the old version of the API, the compliance stream was provided by one end point but now it is provided by 8 different endpoints. We could run the same spark application…
Fanooos
  • 2,718
  • 5
  • 31
  • 55
2
votes
1 answer

Spark Memory/worker issues & what is the correct spark configuration?

I have total of 6 nodes in my spark cluster. 5 nodes had each 4 core and 32GB ram, and one of the nodes(node 4) had 8 cores and 32GB ram. So i have total of 6 nodes - 28 cores, 192GB RAM.( i want to use half of the memory, but all cores) Planning to…
AKC
  • 953
  • 4
  • 17
  • 46
2
votes
1 answer

Reading files dynamically from HDFS from within spark transformation functions

How can a file from HDFS be read in a spark function not using sparkContext within the function. Example: val filedata_rdd = rdd.map { x => ReadFromHDFS(x.getFilePath) } Question is how ReadFromHDFS can be implemented?Usually to read from HDFS we…
darkknight444
  • 546
  • 8
  • 21
2
votes
0 answers

Reading from kafka .8.1 and writing to kafka .9.0

I have a requirement where i should read messages from kafka v.8.1 (in cluster A) and write to kafka v.9.0 (in cluster B). I am using spark streaming to read from kafka A and push messages into kafka B using spark native kafka classes. It is giving…
2
votes
0 answers

Use spark-streaming as a scheduler

I have a Spark job that reads from an Oracle table into a dataframe. The way it seems the jdbc.read method works is to pull an entire table in at once, so I constructed a spark-submit job to work in batch. Whenever I have data I need manipulated I…
tadamhicks
  • 905
  • 1
  • 14
  • 34
2
votes
3 answers

storing a Dataframe to a hive partition table in spark

I'm trying to store a stream of data comming in from a kafka topic into a hive partition table. I was able to convert the dstream to a dataframe and created a hive context. My code looks like this val hiveContext = new…
Riyan Mohammed
  • 247
  • 2
  • 6
  • 20
2
votes
1 answer

Spark Error: invalid log directory /app/spark/spark-1.6.1-bin-hadoop2.6/work/app-20161018015113-0000/3/

My spark application is failing with the above error. Actually my spark program is writing the logs to that directory. Both stderr and stdout are being written to all the workers. My program use to worik fine earlier. But yesterday i changed the…
AKC
  • 953
  • 4
  • 17
  • 46
2
votes
1 answer

why spark can not recovery from checkpoint by using getOrCreate

Following offical doc, I'm trying to revovery StreamingContext: def get_or_create_ssc(): cfg = SparkConf().setAppName('MyApp').setMaster('local[10]') sc = SparkContext(conf=cfg) ssc = StreamingContext(sparkContext=sc, batchDuration=2) …
Zhang Tong
  • 4,569
  • 3
  • 19
  • 38