Questions tagged [spark-structured-streaming]

Spark Structured Streaming allows processing live data streams using DataFrame and Dataset APIs.

Spark Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing with the Dataset/DataFrame APIs available in Python, R (in both sparkr and sparklyr) Scala and Java. Structured streaming is for Spark 2.x and is not to be confused with Spark Streaming which is for Spark 1.x.

External resources:

Spark repartition based on size

Our Spark application consumes from Kafka and writes to Iceberg tables. Each batch of raw data is processed and split according to the target tables without any shuffle but the size of raw data for each table is not evenly divided. What is the best…

apache-spark apache-spark-sql spark-structured-streaming

asked Aug 04 '23 at 17:19

user1373186

votes

1 answer

How yarn knows how to run tasks that consume cores on node manager nodes with the most available cores?

we have 10 node manager nodes that co-hosted with data nodes the available Vcore on nodes described as the following Vcore used Vcore Avilble node manager 1 56 6 node manager 2 35 1 node manager 3 …

apache-spark hadoop-yarn spark-structured-streaming

asked Jul 30 '23 at 11:52

Judy

1,595
6
19
41

votes

0 answers

Does Spark Structured stream support the concept of "tombstone" a.k.a deletion?

I have been developing with kafka streams for several years now. Recently, i got into a project that relies on spark structured streaming. Going through the documentation, to my surprise i could not find something similar to kafka stream when it…

scala apache-kafka spark-streaming spark-structured-streaming

asked Jul 29 '23 at 23:28

MaatDeamon

9,532
9
60
127

votes

0 answers

Schema Evolution from Kafka Source

i, I have a Spark streaming process that reads data from a Kafka topic to Azure DL This is how I implement the MERGE capability into the delta table. In addition to the same topic, I have another streaming process that simply writes data to DL In…

apache-spark apache-kafka spark-structured-streaming

asked Jul 27 '23 at 09:20

Emanuel

votes

1 answer

xml kafka message to dataframe using pyspark

I have an xml message from a kafka topic and im looking to convert the incoming message to a dataframe and use it further. I have achieved the same for json messages but not with xml. if anyone could help on the same. code with json(working…

xml apache-spark pyspark apache-kafka spark-structured-streaming

asked Jul 26 '23 at 17:18

Deepak

votes

0 answers

Spark Kafka: understanding offset management with enable.auto.commit

according to the Kafka documentation offset in Kafka can be managed using enable.offset.commit and auto.commit.interval.ms. I have difficulties understanding the concept. For example I have a Kafka that shall batch load everyday and only shall load…

apache-spark apache-kafka spark-structured-streaming spark-kafka-integration

asked Jul 25 '23 at 09:07

AzUser1

votes

0 answers

How to restart specific structured streaming queries with a Databricks workflow job

I have a notebook that is hosting near 100 streaming queries, 2 of them die occasionally. However, restarting the whole job may take 4 hours. Is it possible to restore the failed streaming queries without restarting the whole job?

databricks spark-structured-streaming

asked Jul 24 '23 at 08:34

Hin Solo

votes

0 answers

How read large csv file located in other computer with pyspark?

I want to read daily a large file with Pyspark. This large file is located on another computer and I have only SSH access to this computer. How can I read this file with Pyspark? And as I say I want to read daily a new file that saves on this remote…

pyspark ssh spark-structured-streaming

asked Jul 23 '23 at 07:32

Tavakoli

1,303
3
18
36

votes

0 answers

Spark join on two small kafka topics takes > 15' per job

I'm using Spark in a local .master(local) environment using two Kafka topics with a message each, performing a join on them. Each job run takes more than 15 minutes override def execute(triggerHandler: () => Boolean): Unit = { while (true) { …

scala apache-spark apache-kafka apache-spark-sql spark-structured-streaming

asked Jul 20 '23 at 13:49

Pablo Cavalieri

votes

2 answers

Copy (in delta format) of an append-only incremental table that is in JDBC (SQL)

My ultimate goal is to have a copy (in delta format) of an append-only incremental table that is in JDBC (SQL). I have a batch process reading from the incremental append-only JDBC (SQL) table, with spark.read (since .readStream is not supported for…

pyspark databricks spark-structured-streaming delta-live-tables

asked Jul 20 '23 at 04:59

Oliver Angelil

1,099
15
31

votes

0 answers

spark structured streaming + apache iceberg how appends can be idempotent

I'm using spark structured streaming to append to iceberg partitioned table. I need to use foreachBatch or foreatch as I'm using custom iceberg catalog implementation. (one from google biglake). Spark doc says foreatchBatch is at-least once means It…

spark-structured-streaming apache-iceberg

asked Jul 20 '23 at 02:06

nir

3,743
4
39
63

votes

1 answer

Spark structured streaming - Kinesis stream

Does Spark supports structured streaming with Kinesis stream as data source? It appears Databricks version supports - https://docs.databricks.com/structured-streaming/kinesis-best-practices.html. However does Spark outside of Databricks support…

amazon-web-services apache-spark spark-structured-streaming kinesis-stream

asked Jul 18 '23 at 17:19

user16798185

votes

0 answers

How to improve spark consumption of messages from kafka

I'm running a bigger spark cluster with 96 cores cluster, and polling messages from kafka. There are around 200 total partitions from different topics, which are being processed in continuous streaming job. The parameters i'm using against kafka…

apache-spark apache-kafka spark-structured-streaming

asked Jul 15 '23 at 17:44

Sandie

votes

1 answer

Spark 2.4.0 Structured Streaming Kafka Consumer Checkpointing

I am using Spark 2.4.0 Structured Streaming (Batch Mode i.e. spark .read vs .readstream)to consume a Kafka topic. I am checkpointing read offsets and using the .option("startingOffsets", ...) to dictate where to continue reading on next job run. In…

apache-spark apache-kafka spark-structured-streaming

asked Jul 15 '23 at 15:28

bzak

votes

2 answers

Count words of a text, without special characters

I need a little bit aid. I want to see each of the elements of the rdd (rddseparar) The idea is to count the words of a text, eliminating the special characters and this is one of de steps for get it import re fileName =…

apache-spark pyspark apache-spark-sql spark-streaming spark-structured-streaming

asked Jul 15 '23 at 09:55

Gregorio Acedo

Prev 1 2 3

…

99 100 Next