Let's say you create a table in DBFS as follows.
%sql
DROP TABLE IF EXISTS silver_loan_stats;
-- Explicitly define our table, providing schema for schema enforcement.
CREATE TABLE silver_loan_stats (
loan_status STRING,
int_rate FLOAT,
revol_util FLOAT,
issue_d STRING,
earliest_cr_line STRING,
emp_length FLOAT,
verification_status STRING,
total_pymnt DOUBLE,
loan_amnt FLOAT,
grade STRING,
annual_inc FLOAT,
dti FLOAT,
addr_state STRING,
term STRING,
home_ownership STRING,
purpose STRING,
application_type STRING,
delinq_2yrs FLOAT,
total_acc FLOAT,
bad_loan STRING,
issue_year DOUBLE,
earliest_year DOUBLE,
credit_length_in_years DOUBLE)
USING DELTA
LOCATION "/tmp/${username}/silver_loan_stats";
Later, you save data (a dataframe named 'loan_stats) to this source LOCATION.
# Configure destination path
DELTALAKE_SILVER_PATH = f"/tmp/{username}/silver_loan_stats"
# Write out the table
loan_stats.write.format('delta').mode('overwrite').save(DELTALAKE_SILVER_PATH)
# Read the table
loan_stats = spark.read.format("delta").load(DELTALAKE_SILVER_PATH)
display(loan_stats)
My questions are:
- Are the table and the source data linked? So e.g. removing or joining data on the table updates it on the source as well, and removing or joining data on the source updates it in the table as well?
- Does the above hold when you create a view instead of a table as well ('createOrReplaceTempView' instead of CREATE TABLE)?
I am trying to see the point of using Spark SQL when the Spark dataframes already offer a lot of functionality.. I guess it makes sense for me if the two are effectively the same data, but if CREATE TABLE (or createOrReplaceTempView) means you create a duplicate then I find it difficult to understand why you would put so much effort (and compute resources) into doing so.