Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
0 answers

delta packages installation on local with pyspark: ivy-cache file not found

I am triying to use delta format so trying this from this site delta lake. On cmd pyspark --packages io.delta:delta-core_2.12:0.8.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf…
Avinash
  • 359
  • 3
  • 5
0
votes
1 answer

Is there an elegant and authoritative way to identify when a directory is in fact a delta table?

I have created the code below to identify if a directory is a delta table/file/directory. Its kind of brute force, but it appears to work for the most part. I am wondering if there is a more elegant way to determine this. I am in a databricks…
0
votes
0 answers

Why Delta Lake Column Mapping needs both physical name and id

On the Documentation for Delta Transaction Protocol (https://github.com/delta-io/delta/blob/master/PROTOCOL.md#column-mapping), the column mapping section states as follows: "There are two modes of column mapping, by name and by id. In both modes,…
0
votes
1 answer

Databricks -Cannot create table The associated location is not empty and also not a Delta table

I am getting the error: Cannot create table ('hive_metastore.MY_SCHEMA.MY_TABLE'). The associated location ('dbfs:/user/hive/warehouse/my_schema.db/my_table') is not empty and also not a Delta table. I tried to overcome this by running drop table…
0
votes
0 answers

ALTER TABLE ADD COLUMN AFTER does not add my column after specified column Delta table

I want to add a column to my delta table via Databricks SQL. Following the syntax on the Microsoft site https://learn.microsoft.com/en-us/azure/databricks/delta/update-schema and Databricks site…
CClvu
  • 1
  • 3
0
votes
1 answer

Databricks/Spark: What is the difference between file pruning and file skipping?

File skipping in delta files is when you skipping reading the file altogether because you know that the the value you are looking for will not exist in the file. This is determined by looking at the column stats. Reading about File pruning - it…
Ashwin
  • 12,691
  • 31
  • 118
  • 190
0
votes
0 answers

Delta Optimize Compaction For Incremental Update

I am working on setting up a data pipeline that produces a huge number of small files in delta tables which is partitioned by certain dimensions. To boost the read performance for the consumers, I am looking to add the compaction. Looking at the…
prlucknow
  • 13
  • 3
0
votes
1 answer

Databricks is treating NULL as string

I am using Databricks Unity Catalog, and I have a requirement to upload a CSV file, process it, and load it into a final table. However, when uploading the file in Databricks, it converts NULL data to the string 'NULL', which is causing an issue. Do…
0
votes
0 answers

what is the benefit of using delta lake or iceberg table format?

We currently store data on S3 using parquet format, and use AWS Glue data catalog to store table metadata. We add partitions by dates or hours. Most of queries that we have are read-only queries. I am wondering the benefits that we can get from…
yuyang
  • 1,511
  • 2
  • 15
  • 40
0
votes
1 answer

Delta lake merge with Idempotent writes for non-streaming source

How to handle crash ( due to any reason ) during delta table merge on the target delta table ? Will it creates duplicate records if i re-ran partial failed (crash) merge command on the same target delta table but source delta table ( (some records…
Kunfu Panda
  • 57
  • 1
  • 2
  • 8
0
votes
0 answers

Connect to delta table in hdfs using python without using Pyspark

I have a delta table in hdfs stored as a hive table. I need to connect to the table and load the latest version of the table. I was able to connect to hdfs using pyarrow library. But it is loading entire versions on the hdfs. Here is my code import…
Josin Mathew
  • 45
  • 2
  • 9
0
votes
1 answer

Read entire MongoDB document to a PySpark DataFrame as a single text colum

I wish to read documents from a MongoDB database into a PySprak DataFrame in a truly schema-less way, as part of the bronze layer of a DataLake architecture on DataBricks. This is important since I want no schema inference or assupmtions to be made…
0
votes
0 answers

Possible bug in using Pyarrow is_null function with delta tables

I've noticed an issue while trying to apply filters on Pyarrow datasets initialised from delta tables. Specifically, the is_null expression predicate only seems to return rows if all the rows in the particular partition/parquet file have null values…
0
votes
1 answer

Delta File Format

Delta file format: Nowadays different file formats with respect to data processing are becoming popular. One of them is Delta format developed and open sourced by Databricks. Its most important feature is ACID (others being - support for…
user3103957
  • 636
  • 4
  • 16
0
votes
0 answers

Does deltalake-rs support streaming?

Does anyone knows if it is possible to use deltalake-rs for streaming with datafusion in Rust? I can’t seem to find anything on this. Spark does have it. See https://docs.delta.io/latest/delta-streaming.html#delta-table-as-a-sink I am looking at…
dade
  • 3,340
  • 4
  • 32
  • 53