Questions tagged [spark-streaming]

Spark Streaming is an extension of the core Apache Spark API that enables high-throughput, fault-tolerant stream processing of live data streams. From the version 1.3.0, it supports exactly-once processing semantics, even in face of failures.

5565 questions
2
votes
0 answers

Training and Prediction in Spark Streaming Machine Learning Model

I am having a hard time understanding how we can both update the machine learning model and use it to make predictions in one spark streaming job. This code is from Spark StreamingLinearRegressionExample class val trainingData =…
Drakan
  • 31
  • 4
2
votes
1 answer

Spark Streaming HiveContext NullPointerException

I'm writing a Spark Streaming application using Spark 1.6.0 on a CDH 5.8.3 cluster. The application is very simple: it reads from Kafka, it makes some transformations the DStream/RDDs and then outputs them to a Hive table. I have also tried to put…
mgaido
  • 2,987
  • 3
  • 17
  • 39
2
votes
0 answers

How to convert RDD string(xml format) to dataframe in spark java?

Good solution available in below link if xml data available in file, https://github.com/databricks/spark-xml Below code convert xml to DataSet by loading physical file.. Dataset df = sqlContext.read().format("com.databricks.spark.xml") …
2
votes
1 answer

UTF-8 encoding error while connecting Flume twitter stream to spark in python

I am having a trouble while passing the Twitter data collected by the Flume agent to Spark Stream. I can download the twits independently while only using the Flume. But I am getting following error. I feel that it is the issue about the default…
smm
  • 838
  • 1
  • 9
  • 31
2
votes
2 answers

What can cause my Spark Streaming checkpoint to be incomplete?

I am playing around with the Spark Streaming API, and specifically testing the checkpointing feature. However, I am finding in certain circumstances the checkpoint being returned is not complete. The following code is being run in local[2] mode…
Joe C
  • 15,324
  • 8
  • 38
  • 50
2
votes
0 answers

Spark Java Collection Accumulator Object adding Issue

I am new to Spark. I am using Spark CollectionAccumulator to add list of customerobjects. For same customer I can have more than one object and all needs to get added in accumulator. What is happening is If i have 3 objects with same customer all…
MKS
  • 129
  • 1
  • 12
2
votes
2 answers

Spark streaming with Yarn: executors not fully utilized

I am running spark streaming with Yarn with - spark-submit --master yarn --deploy-mode cluster --num-executors 2 --executor-memory 8g --driver-memory 2g --executor-cores 8 .. I am consuming Kafka through DireactStream approach (No receiver). I have…
2
votes
0 answers

Most efficient way to write spark streaming data into RDBMS

I am writing a spark streaming job that consumes data from Kafka & writes to RDBMS. I am currently stuck because I do not know which would be the most efficient way to store this streaming data into RDBMS. On searching, I found a few methods -…
2
votes
1 answer

How can I catch the log output of pyspark foreachPartition?

pyspark when I use print() in foreachRdd method, it work! def echo(data): print data .... lines = MQTTUtils.createStream(ssc, brokerUrl, topics) topic_rdd = lines.map(lambda x: get_topic_rdd(x)).filter(lambda x: x[0]!=…
wu alex
  • 21
  • 2
2
votes
1 answer

inappropriate output while creating a dataframe

I'm trying to stream the data from kafka topic using scala application.I'm able to get the data from the topic, but how to create a data frame out of it? Here is the data(in string,string format) { "action": "AppEvent", "tenantid": 298, …
jack AKA karthik
  • 885
  • 3
  • 15
  • 30
2
votes
1 answer

Spark Streaming Kafka direct consumer consumption speed drop

Kafka direct consumer started to limit reads to 450 events(5 * 90 partitions) per batch (5 seconds), it was running fine for 1 or 2 days before that (about 5000 to 40000 events per batch) I'm using spark standalone cluster (spark and…
2
votes
1 answer

Spark streaming joins with multiple history tables

Spark version: 1.5.2 We are trying to implement streaming for the first time and trying to do the CDC on incoming streams and store results in hdfs. What is working We started the POC with 1 table CDC with input file streams. The base (history)…
K. Sam
  • 21
  • 2
2
votes
1 answer

Is there a bug about using RDD.cartesian with Spark Streaming?

My code : ks1 = KafkaUtils.createStream(ssc, zkQuorum='localhost:2181', groupId='G1', topics={'test': 2}) ks2 = KafkaUtils.createStream(ssc, zkQuorum='localhost:2181', groupId='G2', topics={'test': 2}) d1 = ks1.map(lambda x: x[1]).flatMap(lambda x:…
Zhang Tong
  • 4,569
  • 3
  • 19
  • 38
2
votes
2 answers

How to read concurrently from each Kafka partition in Spark Streaming DirectAPI

If I am correct, by default spark streaming 1.6.1 uses a single thread to read data from each Kafka partition, let assume my Kafka topic partition is 50 and that means messages in each 50 partitions will be read sequentially or may in round robin…
2
votes
1 answer

How to extract each JSONobject from JSONArray and save to cassandra in spark streaming

I'm trying to get kafka streaming data which is JSONArray in spark streaming, each JSONArray contain several JSONObject. I want to save each JSONObject into datadrames, and save to cassandra table after mapping with the other table. I've tried to…