Databricks - converting Spark dataframe to table: is it the same data source?

Question

You will need to perform quite some compute to make from the source dataframe, a Spark table, no? Or are dataframe and table both a pointer to the same data (i.e. when creating a table you are not creating duplicate data)?

I guess what I'm trying to figure out is whether you can 'switch on switch off' from a Spark dataframe to a table or if doing so is (very) computationally expensive (it's big data, after all...)

score 2 · Answer 1 · answered Apr 26 '21 at 04:45

2

Dataframe and table both are different in spark.

Dataframe is an immutable distributed collection of data.

Table is the one which has metadata that points to the physical location form where it has to read the data.

When you are converting spark dataframe to a table , you are physically writing data to disc and that could be anything like hdfs,S3, Azure container etc. Once you have the data saved as table you can read it from anywhere like from different spark job or through any other work flow.

Now talking about dataframe it is just valid for the specific spark session in which you created that dataframe and once you close your spark session you cannot read that dataframe or access it values. Dataframe does not have any specific memory location or physical path where it gets saved. Dataframe is just the representation of the data that you read from any specific location.

answered Apr 26 '21 at 04:45

Nikunj Kakadiya

2,689
2
20
35

Thank you Nikunj. And when raw data is stored on a data lake: if you create a table in Databricks using that data lake location as source, does it make a replica of the data lake source or does it point to that same data? That's what I'm really trying to figure out... – beyondtdr Apr 26 '21 at 07:23
So that if you make changes in the table/view, the underlying (data lake) storage gets modified as well? Or is the table/view a copy of the data? – beyondtdr Apr 26 '21 at 07:27
1

when you create a table on data that is stored on a data lake , it just create a table with the location pointing to data lake location and other meta data information. it does not duplicate or replicate the data. when you make changes, the files in the data lake only gets modified as there is no duplication of data anywhere else. if you want to try you can see by dropping the table. in case of dropping the table only metadata information gets deleted but physical files are present in data lake. – Nikunj Kakadiya Apr 26 '21 at 09:21
Oh that's great then :-) Thanks Nikunj, I'm going to test it – beyondtdr Apr 27 '21 at 09:33

Databricks - converting Spark dataframe to table: is it the same data source?

1 Answers1