Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.

1226 questions

vote

2 answers

Register Delta table in Hive metastore error

I need to register Delta table in Hive metastore to be able to query it using external reporting tool connecting to ThriftServer PySpark API works well, I am able to create DeltaTable object ordersDeltaTable = DeltaTable.forPath(spark,…

asked Nov 11 '19 at 16:59

David Greenshtein

vote

1 answer

Efficient execution on PySpark/Delta dataframes

Using pyspark/Delta lakes on Databricks, I have the following scenario: sdf = spark.read.format("delta").table("...") result = sdf.filter(...).groupBy(...).agg(...) analysis_1 = result.groupBy(...).count() # transformation performed here analysis_2…

apache-spark apache-spark-sql databricks delta-lake

asked Nov 01 '19 at 13:22

casparjespersen

3,460
5
38
63

vote

1 answer

What s3 bucket does DBFS use? How can I get the S3 location of a DBFS path

I am trying to migrate my Hive metadata to Glue. While migrating the delta table, when I am providing the same dbfs path, I am getting an error - "Cannot create table: The associated location is not empty. When I am trying to create the same delta…

amazon-s3 databricks aws-glue delta-lake

asked Oct 04 '19 at 16:38

kushagra

vote

1 answer

streaming aggregate not writing into sink

I have to process some files which arrive to me daily. The information have primary key (date,client_id,operation_id). So I created a Stream which append only new data into a delta table: operations\ .repartition('date')\ …

pyspark spark-structured-streaming azure-databricks delta-lake

asked Sep 27 '19 at 13:05

LeandroHumb

vote

3 answers

How to resolve Spark java.lang.OutOfMemoryError: Java heap space while writing out in delta format?

I am loading around 4GB of data from parquet files into a Spark DF. Loading takes few hundred millisecs. Then I register the DF as a table to Execute SQL queries. sparkDF =…

java apache-spark heap-memory delta-lake

asked Sep 18 '19 at 18:44

ShwetaPri

vote

2 answers

How to Convert Parquet to Spark Delta Lake?

I was trying to convert a set of parquet files into delta format in-place. I tried using the CONVERT command as mentioned in the Databricks documentation.…

apache-spark apache-spark-sql delta-lake

asked Sep 17 '19 at 23:03

ShwetaPri

vote

3 answers

Is it possible to connect to databricks deltalake tables from adf

I'm looking for a way to be able to connect to Databricks deltalake tables from ADF and other Azure Services(like Data Catalog). I don't see databricks data store listed in ADF data sources. On a similar question - Is possible to read an Azure…

azure-data-factory azure-databricks delta-lake

asked Sep 13 '19 at 05:46

Mauryas

vote

3 answers

How to write / writeStream each row of a dataframe into a different delta table

Each row of my dataframe has a CSV content. I am strugling to save each row in a different and specific table. I believe I need to use a foreach or UDF in order to accomplish this, but this is simply not working. All the content I managed to find…

pyspark azure-databricks delta-lake

asked Jun 28 '19 at 17:52

Flavio Pegas

vote

1 answer

What is the difference between querying tables using Delta format with Pyspark-SQL versus Pyspark?

I am querying tables but I have different results using two manners, I would like to understand the reason. I created a table using Delta location. I want to query the data that I stored in that location. I'm using Amazon S3. I created the table…

python apache-spark pyspark databricks delta-lake

asked Jun 07 '19 at 15:01

Eric Gabriel Bellet Locker

vote

0 answers

Is this the best method to load and merge data into an existing Delta Table on Databricks?

I'm new to using Databricks and I'm trying to test the validity of continuously loading an hourly file into primary that will be used for reporting. Each hourly file is roughly 3-400gb and contains ~1-1.3b records. I would like to have the primary…

apache-spark apache-spark-sql azure-databricks delta-lake

asked May 02 '19 at 15:20

Kevin Bain

vote

4 answers

What is the best way to cleanup and recreate databricks delta table?

I am trying to cleanup and recreate databricks delta table for integration tests. I want to run the tests on devops agent so i am using JDBC (Simba driver) but it says statement type "DELETE" is not supported. When i cleanup the underlying DBFS…

databricks azure-databricks delta-lake

asked Apr 04 '19 at 10:28

Preeti Joshi

vote

3 answers

Databricks Delta Update

How can we update multiple records in a table from other table using databricks delta. I want to achieve something like : update ExistingTable set IsQualified = updates.IsQualified From updates where ExistingTable.key= updates.key It's failing…

apache-spark-sql databricks delta-lake

asked Sep 04 '18 at 11:27

Psingla

votes

0 answers

how to build delta-rs python for hdfs support

i would build delta-rs v0.10.1 with HDFS support. for this purpose i have costomized Cargo.toml file and add hdfs feature then building and installing it in environment using make install command. features = ["azure", "gcs", "python", "datafusion",…

jvm hdfs delta-lake libzip delta-rs

asked Aug 30 '23 at 06:51

Mohammad Esmaeili

votes

1 answer

How to extract value from a spark dataframe and add it to a second one as a column?

I have 2 large spark dataframes df1 and df2. df1 has a column with a colName name that has only one distinct value. I need to add this column to the df2. I'm wondering what would be the most efficient way to do that? My idea is to use limit() or…

dataframe scala apache-spark azure-data-lake delta-lake

asked Aug 28 '23 at 20:59

Oliwier Mroczkowski

votes

0 answers

How to partition a table with different ingestion sources and volumes?

I have a database implemented with Databricks Delta Lake and Azure Storage. I need to create a table that would have a new data ingested on daily basis. The data would come from diffrent sources. Each day the volume can differ (it can be minor or…

database azure-data-lake delta-lake database-partitioning

asked Aug 28 '23 at 12:18

Oliwier Mroczkowski

Prev 1 2 3

…

81 82 Next