1

With the Databricks Lakehouse platform, it is possible to create 'tables' or to be more specific, delta tables using a statement such as the following,

DROP TABLE IF EXISTS People10M;
CREATE TABLE People10M
USING parquet
OPTIONS (
  path "/mnt/training/dataframes/people-10m.parquet",
  header "true"
);

What I would like to know is, what exactly happens behind the scenes when you create one of these tables? What exactly is a table in this context? Because the data is actually contained in files in data lake (data storage location) that delta lake is running on top of.. right? Are tables some kind of abstraction that allows us to access the data stored in these files using something like SQL?

What does the USING parquet portion of this statement do? Are parquet tables different to CSV tables in some way? Or does this just depend on the format of the source data?

Any links to material that explains this idea would be appreciated? I want to understand this in depth from a technical point of view.

Minura Punchihewa
  • 1,498
  • 1
  • 12
  • 35

1 Answers1

1

There are few aspects here. Your table definition is not a Delta Lake, it's Spark SQL (or Hive) syntax to define a table. It's just a metadata that allows users easily use the table without knowing where it's located, what data format, etc. You can read more about databases & tables in Databricks documentation.

The actual format for data storage is specified by the USING directive. In your case it's parquet, so when people or code will read or write data, underlying engine will first read table metadata, figure out location of the data & file format, and then will use corresponding code.

Delta is another file format (really a storage layer) that is built on the top of Parquet as data format, but adding additional capabilities such as ACID, time travel, etc. (see doc). If you want to use Delta instead of Parquet then you either need to use CONVERT TO DELTA to convert existing Parquet data into Delta, or specify USING delta when creating a completely new table.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • OK, the URL that you have linked also mentions a Hive metastore. I knew that Hive was a technology that allowed us to run SQL queries on our distributed file system, but I was not aware that it worked with Spark as well. So, if I've got this right, when we create a table using Spark SQL, there is no physical table being created? We just store some metadata such as the schema of the table in a metastore (Hive) and through this, there is a kind of logical table that we can work with? The data, however, still remains in the files of our data store (or lake)? – Minura Punchihewa Apr 29 '22 at 19:32
  • Data is always in files. Table like your definition is just a reference to them that contains schema, location, and other information – Alex Ott Apr 29 '22 at 20:11
  • OK, another question I have is, if we do not use the `USING delta` directive as you mentioned, will we not enjoy the benefits of delta lake and delta engine? – Minura Punchihewa Apr 30 '22 at 04:35
  • Yes, if your data not in Delta format, you don’t get benefits of delta engine. – Alex Ott Apr 30 '22 at 05:06
  • Just a follow up though, what about the support for ACID transactions that is offered by delta lake? Does this, again, apply only to delta tables? – Minura Punchihewa Apr 30 '22 at 18:25
  • 1
    yes. they are supported only for delta tables. We often talk about Delta Lakehouse, it's more a concept, but it relies on the transactions & other things that aren't available for formats like Parquet & ORC – Alex Ott Apr 30 '22 at 18:37