I have problems understanding the concept of delta lake. Example:
I read a parquet file:
taxi_df = (spark.read.format("parquet").option("header", "true").load("dbfs:/mnt/randomcontainer/taxirides.parquet"))
Then I save it using asTable:
taxi_df.write.format("delta").mode("overwrite").saveAsTable("taxi_managed_table")
I read the just stored managed table:
taxi_read_from_managed_table = (spark.read.format("delta").option("header", "true").load("dbfs:/user/hive/warehouse/taxi_managed_table/"))
... and when I check the type it shows "pyspark.sql.dataframe.DataFrame", not deltaTable:
type(taxi_read_from_managed_table)
# returns pyspark.sql.dataframe.DataFrameOnly after I transform it explicitly using the following command, I receive the type DeltaTable
taxi_delta_table = DeltaTable.convertToDelta(spark,"parquet.
dbfs:/user/hive/warehouse/taxismallmanagedtable/")
type(taxi_delta_table)
#returns delta.tables.DeltaTable
/////////////////////////////
Does that mean that the table in stage 4. is not a delta table and won’t provide the automatic optimizations provided by delta lake?
How do you establish if something is part of the delta lake or not?
I understand that delta live tables only work with delta.tables.DeltaTables, is that correct?