-1

I recently started learning about spark. I was studying about spark managed tables. so as per docs " spark manages the both the data and metadata". Assume that i have a csv file in s3 and I read it into data frame like below.

df = spark.read
.format("csv")
.option("header", "true") 
.option("inferSchema", "true") 
.load("s3a://databricks-learning-s333/temp/flights.csv")

now i created a spark managed table in data bricks as below..

spark.sql("CREATE DATABASE learn_spark_db")
spark.sql("USE learn_spark_db")

spark.sql("CREATE TABLE managed_us_delay_flights_tbl (date STRING, delay INT,  
  distance INT, origin STRING, destination STRING)")

df.write.saveAsTable("managed_us_delay_flights_tbl")

now it is a spark managed table, so spark manages both the data and metadata.

as per docs, if we delete managed table spark deletes managed table it will delete the both metadata and actual data (docs)

Here are my questions:

  1. the below code deletes the spark managed table, so does it mean it will delete my s3 original data or what does it mean that spark deletes the data and metadata.

    spark.sql('DROP TABLE managed_us_delay_flights_tbl')
    
  2. I read here that when we create managed tables, spark uses the delta format, actually my original data in csv format in s3, does it mean it will change csv to delta format or it will duplicate the same with and write it in delta format somewhere ?

  3. if I create spark managed tables, does it use the same underlying storage or something new location, please explain in detail.

Ravi
  • 2,778
  • 2
  • 20
  • 32

2 Answers2

1
  1. Your CSV file is your source. In the code above Spark will read the CSV file and load it into a dataframe, in the next step, Spark will write the data into a Delta table, without affecting your CSV source. When you delete the managed table, your source CSV is not affected.

  2. Spark will use the Delta format for the new table. Delta format is a type of Parquet format. Thats the default, but you can choose other formats. It will not affect the source CSV, only your table destination.

  3. Managed tables are created under the folder of your database ("learn_spark_db"), find this root folder by using:

     %sql describe learn_spark_db
    
Chen Hirsh
  • 736
  • 1
  • 1
  • 13
0

Q1. the below code deletes the spark managed table, so does it mean it will delete my s3 original data or what does it mean that spark deletes the data and metadata?

Ans. It will not delete your original s3 data, As you are creating managed table data is stored in dbfs under /user/hive/warehouse/learn_spark_db.db/ folder. After executing the drop statement, data will be deleted from /user/hive/warehouse/learn_spark_db.db/ directory not from S3. If you provide location during the creation of the table, it will be treated as an unmanaged table, and only metadata is deleted while dropping the table.

Q2. I read here that when we create managed tables, spark uses the delta format, actually, my original data is in csv format in s3, does it mean it will change csv to delta format or it will duplicate the same and write it in delta format somewhere ?

Ans: It will not change the original data in S3, what it will do it will write the same data in the dfbs under /user/hive/warehouse/learn_spark_db.db/ location as delta format if you don't specify any format. You can see the new data file using the databricks utility:

dbutils.fs.ls("/user/hive/warehouse/learn_spark_db.db/")

Q3. if I create spark managed tables, does it use the same underlying storage or something new location, please explain in detail?

Ans: So Whenever you create a databricks resource an underlying storage account is also created for storing the data typically known as the databricks file system(dbfs).

DBFS (Databricks File System) is a distributed file system used by Databricks clusters. DBFS is an abstraction layer over cloud storage (e.g. S3 or Azure Blob Store), allowing external storage buckets to be mounted as paths in the DBFS namespace

you can see using UI or using databricks utility :

dbutils.fs.ls("/")

Now to your question, whenever anyone can create a managed table it will store both metadata and data in an underlying databricks managed storage account.

You can see this using:

dbutils.fs.ls("/user/hive/warehouse/")
Ravi
  • 2,778
  • 2
  • 20
  • 32