Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
1 answer

Issue while running delta tables

I am trying to run the delta tables enter image description here And i am getting error : java.util.concurrent.ExecutionException: com.databricks.sql.transaction.tahoe.DeltaFileNotFoundException:…
0
votes
1 answer

Unable to write DF in delta format on hdfs

I am using scala to simply write a Data Frame in delta format in a hdfs but getting an error which I am unable to understand whats causing it, please help me with this Below is the code using which I am writing a delta table in my local hdfs. val…
Yash Tandon
  • 345
  • 5
  • 18
0
votes
1 answer

'DeltaTable' object has no attribute 'optimize'

I am trying to compact the delta table in Databricks, but for some reason, it fails to execute optimize command on delta table and shows the below error. Can someone please tell me what is the issue here?
subro
  • 1,167
  • 4
  • 20
  • 32
0
votes
0 answers

Delta Lake connector query change data feed entries of the table

Starting from version 408, Trino adds support for creating tables with the Trino change_data_feed_enabled table property. I am using Trino version 413. I already have some delta table and data in AWS S3, which is built from using PySpark, with…
Jonathan Lam
  • 1,761
  • 2
  • 8
  • 17
0
votes
1 answer

how to change the datatype for a column in spark sql

I want to change the datatype of a column from bigint to double in spark for a delta table. AnalysisException: Cannot update spark_catalog.default.tablename field column_name: bigint cannot be cast to decimal(10,2); line 1 pos 0; AlterColumn…
younus
  • 412
  • 2
  • 10
  • 20
0
votes
0 answers

Reading a empty string from source and then writing as "Null" in delta table

In my case , the source filed is having blank/"" empty string. in Bronze layer when i am reading it i have used .option("nullValue", "null") and i am using autoloader in raw to bronze. The value is written in bronze delta table as expected which is…
sayan nandi
  • 83
  • 1
  • 6
0
votes
0 answers

How costly is it to change the datatype of a column in Delta Lake?

I have a big data pipeline in Spark that writes output in parquet to delta lake (backed by storage accounts on Azure). The output schemas keep changing as I'm still figuring out what I need them to be, and sometimes this results in a column needing…
ROODAY
  • 756
  • 1
  • 6
  • 23
0
votes
0 answers

spark sql Update one column in a delta table on silver layer

I have a look up table which looks like below attached screenshot. and the delta table is as below here as you can see materialnum for all in the silver table is set as null which i am trying to update from the look up table based on SERIALNUM. In…
sayan nandi
  • 83
  • 1
  • 6
0
votes
1 answer

Delta Lake change log?

I have a Databricks environment and I need to create a real-time log table that contains all instances where any delta table in my hive metastore changes. CREATE, ALTER, INSERT, DELETE, any change to the table. I need this to serve as a trigger to…
Haze
  • 1
0
votes
0 answers

how lazy is DeltaTable.toDF (Spark and delta.io)?

Suppose you do something like import io.delta.tables._ val deltaTable = DeltaTable.forPath(spark, "...") deltaTable.updateExpr( "column_name = value", Map("updated_column" -> "'new_value'") val df = deltaTable.toDF Will df re-read the…
wrschneider
  • 17,913
  • 16
  • 96
  • 176
0
votes
1 answer

MERGE in pyspark sql

I have a requirement where in my adls gen2 silver table i have to update one of the column based on a condition when the case is matched and if not then it will be default value. The code i am using is as below spark.sql(f"""MERGE INTO…
0
votes
0 answers

Issue in implementation of interpol and resample of tempo databricks

I am using databricks delta - Tempo to aggregate timeseries data. System has data for each seconds and I need to aggregate the data and shows the data at hours level of each day and day level data. I have a dataset in following format where data is…
rbpjava
  • 3
  • 3
0
votes
1 answer

Create delta table from CSV file in synapse using pyspark with user defined schema, columns should be able to manage upto 30000 character length

I want to create table from CSV file, with the standard column data types like datetime, varchar, int etc and columns can accommodate upto 30000 character length and also able to handle clob columns. I have a CSV files which I am converting into…
0
votes
1 answer

Connecting and Authenticating to Delta Lake on Azure Data Lake Storage Gen 2 using delta-rs Python API

I am trying to connect and authenticate to an existing Delta Table in Azure Data Lake Storage Gen 2 using the Delta-rs Python API. I found the Delta-rs library from this StackOverflow question: Delta Lake independent of Apache Spark? However, the…
0
votes
1 answer

Is there a command to convert existing parquet data to Iceberg table in place?

Delta Lake has the capability of transforming existing parquet data to a delta table, by "simply" adding its own metadata - the _delta_log file. https://docs.delta.io/2.2.0/delta-utility.html#convert-a-parquet-table-to-a-delta-table -- Convert…
YFl
  • 845
  • 7
  • 22