Are Databricks SQL tables & views duplicates of the source data, or do you update the same data source?

Question

Let's say you create a table in DBFS as follows.

%sql
DROP TABLE IF EXISTS silver_loan_stats;

-- Explicitly define our table, providing schema for schema enforcement.
CREATE TABLE silver_loan_stats (
    loan_status STRING, 
    int_rate FLOAT, 
    revol_util FLOAT, 
    issue_d STRING, 
    earliest_cr_line STRING, 
    emp_length FLOAT, 
    verification_status STRING, 
    total_pymnt DOUBLE, 
    loan_amnt FLOAT, 
    grade STRING, 
    annual_inc FLOAT, 
    dti FLOAT, 
    addr_state STRING, 
    term STRING, 
    home_ownership STRING, 
    purpose STRING, 
    application_type STRING, 
    delinq_2yrs FLOAT, 
    total_acc FLOAT,
    bad_loan STRING,
    issue_year DOUBLE,
    earliest_year DOUBLE,
    credit_length_in_years DOUBLE)
USING DELTA
LOCATION "/tmp/${username}/silver_loan_stats";

Later, you save data (a dataframe named 'loan_stats) to this source LOCATION.

# Configure destination path
DELTALAKE_SILVER_PATH = f"/tmp/{username}/silver_loan_stats"

# Write out the table
loan_stats.write.format('delta').mode('overwrite').save(DELTALAKE_SILVER_PATH)

# Read the table
loan_stats = spark.read.format("delta").load(DELTALAKE_SILVER_PATH)

display(loan_stats)

My questions are:

Are the table and the source data linked? So e.g. removing or joining data on the table updates it on the source as well, and removing or joining data on the source updates it in the table as well?
Does the above hold when you create a view instead of a table as well ('createOrReplaceTempView' instead of CREATE TABLE)?

I am trying to see the point of using Spark SQL when the Spark dataframes already offer a lot of functionality.. I guess it makes sense for me if the two are effectively the same data, but if CREATE TABLE (or createOrReplaceTempView) means you create a duplicate then I find it difficult to understand why you would put so much effort (and compute resources) into doing so.

score 1 · Answer 1 · answered May 01 '21 at 03:14

1

The table and source data are linked in that the metastore contains the table information (silver_loan_stats) and that table points to the location as defined in DELTALAKE_SILVER_PATH.
The CREATE TABLE is really a CREATE EXTERNAL TABLE as the table and its metadata is defined in the DELTALAKE_SILVER_PATH - specifically the ``DELTALAKE_SILVER_PATH/_delta_log`.

To clarify, you are not duplicating the data when you do this - it's just an intermixing of SQL vs. API. HTH!

answered May 01 '21 at 03:14

Denny Lee

3,154
1
20
33

Thank you Denny, the 'external table' & 'metadata' keywords were helpful for me to learn more. For my further understanding: why do you need to REFRESH TABLE if data on source location is updated? If you try to update an old/cached table without having refreshed, will it throw an error because it knows it's not working with the right data? E.g. updating some data in a table might work on the old/cached table but the latest data on the source location no longer has this data? So you might think you have updated the table successfully but behind the scenes the data is much different already..? – beyondtdr May 03 '21 at 10:47
1

Cool stuff - yes, the REFRESH TABLE is to invalidate any cache entries. In the case of updating a table, the good news is that because Delta is MVCC, updates/delete/merges are actually new files so its most likely going to invalidate any cache anyways because the table will need to read from new files anyways. HTH! – Denny Lee May 04 '21 at 01:47

Are Databricks SQL tables & views duplicates of the source data, or do you update the same data source?

1 Answers1