Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
1
vote
2 answers

Register Delta table in Hive metastore error

I need to register Delta table in Hive metastore to be able to query it using external reporting tool connecting to ThriftServer PySpark API works well, I am able to create DeltaTable object ordersDeltaTable = DeltaTable.forPath(spark,…
1
vote
1 answer

Efficient execution on PySpark/Delta dataframes

Using pyspark/Delta lakes on Databricks, I have the following scenario: sdf = spark.read.format("delta").table("...") result = sdf.filter(...).groupBy(...).agg(...) analysis_1 = result.groupBy(...).count() # transformation performed here analysis_2…
casparjespersen
  • 3,460
  • 5
  • 38
  • 63
1
vote
1 answer

What s3 bucket does DBFS use? How can I get the S3 location of a DBFS path

I am trying to migrate my Hive metadata to Glue. While migrating the delta table, when I am providing the same dbfs path, I am getting an error - "Cannot create table: The associated location is not empty. When I am trying to create the same delta…
kushagra
  • 131
  • 3
  • 10
1
vote
1 answer

streaming aggregate not writing into sink

I have to process some files which arrive to me daily. The information have primary key (date,client_id,operation_id). So I created a Stream which append only new data into a delta table: operations\ .repartition('date')\ …
1
vote
3 answers

How to resolve Spark java.lang.OutOfMemoryError: Java heap space while writing out in delta format?

I am loading around 4GB of data from parquet files into a Spark DF. Loading takes few hundred millisecs. Then I register the DF as a table to Execute SQL queries. sparkDF =…
ShwetaPri
  • 173
  • 1
  • 4
  • 11
1
vote
2 answers

How to Convert Parquet to Spark Delta Lake?

I was trying to convert a set of parquet files into delta format in-place. I tried using the CONVERT command as mentioned in the Databricks documentation.…
ShwetaPri
  • 173
  • 1
  • 4
  • 11
1
vote
3 answers

Is it possible to connect to databricks deltalake tables from adf

I'm looking for a way to be able to connect to Databricks deltalake tables from ADF and other Azure Services(like Data Catalog). I don't see databricks data store listed in ADF data sources. On a similar question - Is possible to read an Azure…
Mauryas
  • 41
  • 2
  • 8
1
vote
3 answers

How to write / writeStream each row of a dataframe into a different delta table

Each row of my dataframe has a CSV content. I am strugling to save each row in a different and specific table. I believe I need to use a foreach or UDF in order to accomplish this, but this is simply not working. All the content I managed to find…
Flavio Pegas
  • 388
  • 1
  • 9
  • 26
1
vote
1 answer

What is the difference between querying tables using Delta format with Pyspark-SQL versus Pyspark?

I am querying tables but I have different results using two manners, I would like to understand the reason. I created a table using Delta location. I want to query the data that I stored in that location. I'm using Amazon S3. I created the table…
1
vote
0 answers

Is this the best method to load and merge data into an existing Delta Table on Databricks?

I'm new to using Databricks and I'm trying to test the validity of continuously loading an hourly file into primary that will be used for reporting. Each hourly file is roughly 3-400gb and contains ~1-1.3b records. I would like to have the primary…
1
vote
4 answers

What is the best way to cleanup and recreate databricks delta table?

I am trying to cleanup and recreate databricks delta table for integration tests. I want to run the tests on devops agent so i am using JDBC (Simba driver) but it says statement type "DELETE" is not supported. When i cleanup the underlying DBFS…
Preeti Joshi
  • 841
  • 1
  • 13
  • 20
1
vote
3 answers

Databricks Delta Update

How can we update multiple records in a table from other table using databricks delta. I want to achieve something like : update ExistingTable set IsQualified = updates.IsQualified From updates where ExistingTable.key= updates.key It's failing…
Psingla
  • 21
  • 1
  • 5
0
votes
0 answers

how to build delta-rs python for hdfs support

i would build delta-rs v0.10.1 with HDFS support. for this purpose i have costomized Cargo.toml file and add hdfs feature then building and installing it in environment using make install command. features = ["azure", "gcs", "python", "datafusion",…
0
votes
1 answer

How to extract value from a spark dataframe and add it to a second one as a column?

I have 2 large spark dataframes df1 and df2. df1 has a column with a colName name that has only one distinct value. I need to add this column to the df2. I'm wondering what would be the most efficient way to do that? My idea is to use limit() or…
0
votes
0 answers

How to partition a table with different ingestion sources and volumes?

I have a database implemented with Databricks Delta Lake and Azure Storage. I need to create a table that would have a new data ingested on daily basis. The data would come from diffrent sources. Each day the volume can differ (it can be minor or…