Is it possible for a PySpark job to write in a delta table and also read from the same in the same code? Here is what I'm trying to do.
Problem statement: I'm having trouble printing the data on the console to see what is flowing.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from delta import *
spark = SparkSession \
.builder \
.appName("test") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:2.1.0") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
kafka_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "demo.topic") \
.option("startingOffsets", "earliest") \
.load() \
.withColumn("ingested_timestamp", unix_timestamp()) \
.withColumn("value_str", col("value").cast(StringType())) \
.select("ingested_timestamp", "value_str")
# code to write in the delta table called events
stream = kafka_df.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "./data/tmp/delta/events/_checkpoints/") \
.toTable("events")
# code to read the same delta table
read_df = spark.read.format("delta").table("events");
read_df.show(5)
stream.awaitTermination()
The code runs without an error using the following command.
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,io.delta:delta-core_2.12:2.1.0 kafka_and_create_delta_table.py
I'm trying to visualize the data that I'm flushing to Kafka into the Delta table to make sure the data is flowing fine and the underlying component works well too.
I can see an empty table even after sending traffic to my topic.
Found no committed offset for the partition demo.topic-0
+------------------+---------+
|ingested_timestamp|value_str|
+------------------+---------+
+------------------+---------+
Any kind of assistance would be helpful.
Also, I tried running write logic in one job and kept the read job in another. Read Job:
spark = SparkSession \
.builder \
.appName("test") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:2.1.0") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
read_df = spark.read.table("events");
read_df.show(5)
read_df.awaitTermination()
Then this read job was complaining,
pyspark.sql.utils.AnalysisException: Table or view not found: events; 'UnresolvedRelation [events], [], false