Trying to write and read delta table in the same pyspark structured streaming job. Can't see data

Question

Is it possible for a PySpark job to write in a delta table and also read from the same in the same code? Here is what I'm trying to do.

Problem statement: I'm having trouble printing the data on the console to see what is flowing.

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from delta import *

spark = SparkSession \
    .builder \
    .appName("test") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.1.0") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

kafka_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "demo.topic") \
.option("startingOffsets", "earliest") \
.load() \
.withColumn("ingested_timestamp", unix_timestamp()) \
.withColumn("value_str", col("value").cast(StringType())) \
.select("ingested_timestamp", "value_str")

# code to write in the delta table called events
stream = kafka_df.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "./data/tmp/delta/events/_checkpoints/") \
.toTable("events")

# code to read the same delta table
read_df = spark.read.format("delta").table("events");
read_df.show(5)

stream.awaitTermination()

The code runs without an error using the following command.

spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,io.delta:delta-core_2.12:2.1.0 kafka_and_create_delta_table.py

I'm trying to visualize the data that I'm flushing to Kafka into the Delta table to make sure the data is flowing fine and the underlying component works well too.

I can see an empty table even after sending traffic to my topic.

Found no committed offset for the partition demo.topic-0
+------------------+---------+
|ingested_timestamp|value_str|
+------------------+---------+
+------------------+---------+

Any kind of assistance would be helpful.

Also, I tried running write logic in one job and kept the read job in another. Read Job:

spark = SparkSession \
    .builder \
    .appName("test") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.1.0") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

read_df = spark.read.table("events");
read_df.show(5)

read_df.awaitTermination()

Then this read job was complaining,

pyspark.sql.utils.AnalysisException: Table or view not found: events; 'UnresolvedRelation [events], [], false

Trying to write and read delta table in the same pyspark structured streaming job. Can't see data

0 Answers0