I believe I understand the basic difference between Managed and External tables in Spark SQL. Just for clarity, given below is how I would explain it.
A managed table is a Spark SQL table for which Spark manages both the data and the metadata. In the case of a managed table, Databricks stores the metadata and data in DBFS in your account. Since Spark SQL manages the tables, doing a DROP TABLE deletes both the metadata and data.
Another option is to let Spark SQL manage the metadata, while you control the data location. We refer to this as an unmanaged table. Spark SQL manages the relevant metadata, so when you perform DROP TABLE, Spark removes only the metadata and not the data itself. The data is still present in the path you provided.
Now, over time, I have come across a few different caveats when it comes to creating External tables. I will post the relevant Spark SQL queries and what I understand of it below. I would like to know if my understanding is correct and I would also like to hear any additional insight.
Query 1
CREATE TABLE test_tbl
USING CSV
LOCATION '/mnt/csv_files'
This basic query will create a table using the data that is stored in the given LOCATION
. No additional directories or data files will be created here and the data will continue to reside in '/mnt/csv_files'
.
Any data that is inserted, updated or deleted from this table will be reflected in the data files in the path given?
Query 2
CREATE TABLE test_tbl(id STRING, value STRING)
USING PARQUET
OPTIONS (PATH '/mnt/test_tbl')
This query will create the table, but also create a directory as defined by the given path. Any data that is added to this table will result in the creation of data files within the path defined: '/mnt/test_tbl'
.
Is the OPTIONS (PATH ..)
clause any different from the LOCATION
clause?
Query 3
CREATE TABLE test_tbl
LOCATION '/mnt/test_tbl'
AS SELECT * FROM tmp
Similar to the previous query, this will create the table, but also create a directory as defined by the given path. The data, however, will come from the VIEW tmp
, as defined by the AS
query. Based on the data in this view, data files will be created within the new directory as well.
Any operations performed on this table will be reflected in the data files that were so created and the VIEW will remain untouched.
Are the explanations given above correct? Are there any other methods that can be used to create External tables?