Questions tagged [spark-structured-streaming]

Spark Structured Streaming allows processing live data streams using DataFrame and Dataset APIs.

Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with the Dataset/DataFrame APIs available in Python, R (in both and ) Scala and Java. Structured streaming is for Spark 2.x and is not to be confused with Spark Streaming which is for Spark 1.x.

External resources:

See also:

2360 questions
0
votes
0 answers

Spark repartition based on size

Our Spark application consumes from Kafka and writes to Iceberg tables. Each batch of raw data is processed and split according to the target tables without any shuffle but the size of raw data for each table is not evenly divided. What is the best…
0
votes
1 answer

How yarn knows how to run tasks that consume cores on node manager nodes with the most available cores?

we have 10 node manager nodes that co-hosted with data nodes the available Vcore on nodes described as the following Vcore used Vcore Avilble node manager 1 56 6 node manager 2 35 1 node manager 3 …
Judy
  • 1,595
  • 6
  • 19
  • 41
0
votes
0 answers

Does Spark Structured stream support the concept of "tombstone" a.k.a deletion?

I have been developing with kafka streams for several years now. Recently, i got into a project that relies on spark structured streaming. Going through the documentation, to my surprise i could not find something similar to kafka stream when it…
MaatDeamon
  • 9,532
  • 9
  • 60
  • 127
0
votes
0 answers

Schema Evolution from Kafka Source

i, I have a Spark streaming process that reads data from a Kafka topic to Azure DL This is how I implement the MERGE capability into the delta table. In addition to the same topic, I have another streaming process that simply writes data to DL In…
0
votes
1 answer

xml kafka message to dataframe using pyspark

I have an xml message from a kafka topic and im looking to convert the incoming message to a dataframe and use it further. I have achieved the same for json messages but not with xml. if anyone could help on the same. code with json(working…
0
votes
0 answers

Spark Kafka: understanding offset management with enable.auto.commit

according to the Kafka documentation offset in Kafka can be managed using enable.offset.commit and auto.commit.interval.ms. I have difficulties understanding the concept. For example I have a Kafka that shall batch load everyday and only shall load…
0
votes
0 answers

How to restart specific structured streaming queries with a Databricks workflow job

I have a notebook that is hosting near 100 streaming queries, 2 of them die occasionally. However, restarting the whole job may take 4 hours. Is it possible to restore the failed streaming queries without restarting the whole job?
Hin Solo
  • 11
  • 2
0
votes
0 answers

How read large csv file located in other computer with pyspark?

I want to read daily a large file with Pyspark. This large file is located on another computer and I have only SSH access to this computer. How can I read this file with Pyspark? And as I say I want to read daily a new file that saves on this remote…
Tavakoli
  • 1,303
  • 3
  • 18
  • 36
0
votes
0 answers

Spark join on two small kafka topics takes > 15' per job

I'm using Spark in a local .master(local) environment using two Kafka topics with a message each, performing a join on them. Each job run takes more than 15 minutes override def execute(triggerHandler: () => Boolean): Unit = { while (true) { …
0
votes
2 answers

Copy (in delta format) of an append-only incremental table that is in JDBC (SQL)

My ultimate goal is to have a copy (in delta format) of an append-only incremental table that is in JDBC (SQL). I have a batch process reading from the incremental append-only JDBC (SQL) table, with spark.read (since .readStream is not supported for…
0
votes
0 answers

spark structured streaming + apache iceberg how appends can be idempotent

I'm using spark structured streaming to append to iceberg partitioned table. I need to use foreachBatch or foreatch as I'm using custom iceberg catalog implementation. (one from google biglake). Spark doc says foreatchBatch is at-least once means It…
nir
  • 3,743
  • 4
  • 39
  • 63
0
votes
1 answer

Spark structured streaming - Kinesis stream

Does Spark supports structured streaming with Kinesis stream as data source? It appears Databricks version supports - https://docs.databricks.com/structured-streaming/kinesis-best-practices.html. However does Spark outside of Databricks support…
0
votes
0 answers

How to improve spark consumption of messages from kafka

I'm running a bigger spark cluster with 96 cores cluster, and polling messages from kafka. There are around 200 total partitions from different topics, which are being processed in continuous streaming job. The parameters i'm using against kafka…
Sandie
  • 869
  • 2
  • 12
  • 22
0
votes
1 answer

Spark 2.4.0 Structured Streaming Kafka Consumer Checkpointing

I am using Spark 2.4.0 Structured Streaming (Batch Mode i.e. spark .read vs .readstream)to consume a Kafka topic. I am checkpointing read offsets and using the .option("startingOffsets", ...) to dictate where to continue reading on next job run. In…
bzak
  • 483
  • 4
  • 14
0
votes
2 answers

Count words of a text, without special characters

I need a little bit aid. I want to see each of the elements of the rdd (rddseparar) The idea is to count the words of a text, eliminating the special characters and this is one of de steps for get it import re fileName =…