Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
1 answer

Azure Databricks - Unable to create symlink

I am trying to create symlink on databricks delta table which is on ADLS(Azure) with the below command. %sql %sql GENERATE symlink_format_manifest FOR TABLE schema_name.`dbfs:/filepath`; which fails with the below error: Error in SQL statement:…
SunithaM
  • 61
  • 1
  • 7
0
votes
0 answers

How to create a checkpoint for DeltaLake in Azure Databricks for past data with Pyspark?

I don't know how to create a checkpoint for DeltaLake in Azure Databricks for a past version. I tried to access the "DeltaLog" object without success to execute that: DeltaLog.forTable(spark,dataPath).checkpoint() I would like to create checkpoints…
Nastasia
  • 557
  • 3
  • 22
0
votes
1 answer

pyspark - microbatch streaming delta table as a source to perform merge against another delta table - foreachbatch is not getting invoked

I have created a delta table and now I'm trying to perform merge data to that table using foreachBatch(). I've followed this example. I am running this code in dataproc image 1.5x in google cloud. Spark version 2.4.7 Delta version 0.6.0 My code…
Rak
  • 196
  • 2
  • 9
0
votes
1 answer

Databricks Delta Lake Structured Streaming Performance with event hubs and ADLS g2

I'm currently attempting to process telemetry data which has a volume of around 4TB a day using Delta Lake on Azure Databricks. I have a dedicated event hub cluster where the events are written to and I am attempting to ingest this eventhub into…
0
votes
2 answers

MSCK REPAIR TABLE working strangely on delta tables

I have a delta table in s3 and for the same table, I have defined an external table in Athena. After creating the Athena table and generating manifests, I am loading the partitions using MSCK REPAIR TABLE. All the partition columns are in…
Ankit Anand
  • 321
  • 3
  • 7
0
votes
1 answer

Is it possible to connect to databricks sqlanalytics service from a django app

My clients use databricks for data engineering workloads and are interested in using databricks sqlanalytics to service their BI requirements. I want to know if it is possible to connect to databricks sqlanalytics service from django app (since most…
0
votes
1 answer

Spark DataFrame is not saved in Delta format

I want to save Spark DataFrame in Delta format to S3, however, for some reason, the data is not saved. I debugged all the processing steps there was data and right before saving it, I ran count on the DataFrame which returned 24 rows. But as soon as…
Cassie
  • 2,941
  • 8
  • 44
  • 92
0
votes
1 answer

Delta Lake: don't we need time partition for full reprocessed tables anymore

Objective Suppose you're building Data Lake and Star Schema with help of ETL. Storage format is Delta Lake. One of the ETL responsibilities is to build Slowly Changing Dimension (SCD) tables (cummulative state). This means that every day for every…
VB_
  • 45,112
  • 42
  • 145
  • 293
0
votes
1 answer

Add Delta Lake packages to AWS EMR Notebook

Delta jardelta-core_2.11-0.6.1.jar is added to EMR Master node "SPARK_HOME/jars" directory. However calling Delta API from EMR Notebook I am getting following error: # Though Notebook comes with default SPARK instant so following line I didn't…
0
votes
0 answers

Azure Synapse vs Delta Lake

I would like to know if there are any decision making factors between Azure Synapse vs Delta Lake(Databricks)? in regards of performance/features/PRICE.... Given that the original design is that heavily rely on Store procedure (Azure Synapse) to do…
mytabi
  • 639
  • 2
  • 12
  • 28
0
votes
2 answers

Does Databricks cluster need to be always up for VACUUM operation of Delta Lake?

I am using Azure Databricks with latest runtime for the clusters. I had some confusion regarding VACUUM operation in delta lake. We know we can set a retention duration on the deleted data, however, for actual data to be delete after the retention…
0
votes
0 answers

Apache Spark - Delta Lake Structured Streaming: Empty Batch: 0 leads to Null pointer exception

I am trying to read an MQTT stream using Apache Spark and Apache Bahir for MQTT. I use as a sink a Delta table. However, when I start the program it always crashes giving a Null Pointer Exception. I noticed that when the program starts the first…
0
votes
1 answer

Top 5 records using SQL query

I'm trying to retrieve top 5 marks from school database in databricks deltatable using SQL query. So I wrote following query select rs.State, rs.Year, rs.CW, rs.Country, rs.SchoolName, rs.EducationSystem, rs.MarksS1, rs.MarksS2,…
newinPython
  • 313
  • 1
  • 6
  • 19
0
votes
2 answers

aggregation and week of the year in pyspark dataframe

I have below schema in the dataframe root |-- device_id: string (nullable = true) |-- eventName: string (nullable = true) |-- client_event_time: timestamp (nullable = true) |-- eventDate: date (nullable = true) |-- deviceType: string (nullable…
0
votes
1 answer

Curation process with Delta Lake libraries (without Databricks)

I am using AWS DMS to pull data from Oracle It lands into S3 Raw Bucket Using AWS Glue, I want to write pyspark code WITHOUT using databricks product to merge CDC data with initial load. What libraries would I need to import specifically in spark…
PKad
  • 1
  • 1