2

I believe I understand the basic difference between Managed and External tables in Spark SQL. Just for clarity, given below is how I would explain it.

A managed table is a Spark SQL table for which Spark manages both the data and the metadata. In the case of a managed table, Databricks stores the metadata and data in DBFS in your account. Since Spark SQL manages the tables, doing a DROP TABLE deletes both the metadata and data.

Another option is to let Spark SQL manage the metadata, while you control the data location. We refer to this as an unmanaged table. Spark SQL manages the relevant metadata, so when you perform DROP TABLE, Spark removes only the metadata and not the data itself. The data is still present in the path you provided.

Now, over time, I have come across a few different caveats when it comes to creating External tables. I will post the relevant Spark SQL queries and what I understand of it below. I would like to know if my understanding is correct and I would also like to hear any additional insight.

Query 1

CREATE TABLE test_tbl 
USING CSV 
LOCATION '/mnt/csv_files'

This basic query will create a table using the data that is stored in the given LOCATION. No additional directories or data files will be created here and the data will continue to reside in '/mnt/csv_files'.

Any data that is inserted, updated or deleted from this table will be reflected in the data files in the path given?

Query 2

CREATE TABLE test_tbl(id STRING, value STRING) 
USING PARQUET 
OPTIONS (PATH '/mnt/test_tbl')

This query will create the table, but also create a directory as defined by the given path. Any data that is added to this table will result in the creation of data files within the path defined: '/mnt/test_tbl'.

Is the OPTIONS (PATH ..) clause any different from the LOCATION clause?

Query 3

CREATE TABLE test_tbl 
LOCATION '/mnt/test_tbl'
AS SELECT * FROM tmp

Similar to the previous query, this will create the table, but also create a directory as defined by the given path. The data, however, will come from the VIEW tmp, as defined by the AS query. Based on the data in this view, data files will be created within the new directory as well.

Any operations performed on this table will be reflected in the data files that were so created and the VIEW will remain untouched.

Are the explanations given above correct? Are there any other methods that can be used to create External tables?

Minura Punchihewa
  • 1,498
  • 1
  • 12
  • 35

1 Answers1

2

Query 1: Your understanding is correct.

Query 2: As far as I know, Location and Path do the same thing. Location usage is more of HIVE style and Options(Path) usage is more towards Scala/Python style. Other than this I don't think they are any different.

Query 3: Yes output of the View tmp will be stored in the location. Any operations done to this table will not affect the original view. Only thing I want to add is, in the third syntax if you miss the USING then by default it creates a DELTA table in newer DBRs.

You can check the table definitions using

%sql

describe formatted test_tbl 

scroll down and look for the value of "TYPE". It should be External

Ganesh Chandrasekaran
  • 1,578
  • 12
  • 17