0

I am trying to read a specific file from a folder which contain multiple delta files,Please refer attached screenshot

Reason I am looking to read the delta file based on the schema version. The folder mentioned above contains files with different different schema structure.

code snippet for writing a file :

df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save("/home/games/Documents/test_delta/")

Code for reading a delta file

import pyspark[![enter image description here][1]][1]

from delta import *

builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()

path_to_data = '/home/games/Documents/test_delta/_delta_log/00000000000000000001.json'
df = spark.read.format("delta").load(path_to_data)
df.show()

error :

org.apache.spark.sql.delta.DeltaAnalysisException: /home/games/Documents/test_delta/_delta_log/ is not a Delta table.

enter image description here

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
SherKhan
  • 84
  • 1
  • 7
  • Delta file extension is .delta not .json You are not reading delta files, you are trying to read .json file to create delta table if my understanding is not wrong. First, you have to read all .json files in DataFrame and while writing DataFrame you have to specify format as .delta and you have to use Save(external location). If you use SaveAsTable your table will create in Hive meta store. – Sandesh Oct 20 '22 at 13:56
  • @Sandesh : Thnks, Actually , I am able to read the delta file by "/home/games/Documents/test_delta/" but the problem is it is giving only latest schema, but I want to read specific delta table, any suggestion what wrng I am doing here. – SherKhan Oct 20 '22 at 15:29

1 Answers1

1

You should use:

df = spark.read.format("delta").option("versionAsOf", 0).load(path_to_data)

You can specify other versions instead of 0 depending upon how many times how have overwritten the data. You can also use timestamps. Please see delta quick-start for more info.

Also, the delta_log folder actually contains delta transaction log in json format, not the actual data. The data is present in parent folder (test_delta in your case). The files starting with part-0000 are the ones that contain the actual data. These are .parquet files. There are no files with .delta extensions.

o_O
  • 341
  • 3
  • 14