Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
0 answers

Pyspark Delta OSS Merge into fails on not so large table

I have table with 22m rows and Im trying to update it via combination of 3 keys. Merge Worth to mention: Every time its run (daily basis) It execute (right after the merge) a ZOrder Optimization base_merge_into = ( table …
Benny Elgazar
  • 243
  • 2
  • 9
0
votes
0 answers

How to optimise the operations on pyspark dataframe?

I have a table which has 6070800 records after filtering on a particular sff_file_id. See below SS for schema. There are 299 total measures, like measure1, measure2 ... measure299 I am reading it as dataframe, then using stack operation to get the…
0
votes
0 answers

Spark structured streaming app skips maxOffsetsPerTrigger option on first run

I have a Spark structured streaming application that reads data from one Kafka topic, does data validation, and writes to multiple Delta tables. After releasing a new application version and redeployment, the first trigger processed much more data…
0
votes
0 answers

Delta inconsistent concurrent merge result with Serializable Isolation level

I have 2 merges updating the same target table at a given instant and the table has 'Serializable' Isolation level. One of the merge succeeded and updated the records whereas second one was also succeeded but did not update the records. I was…
Gagan
  • 1,775
  • 5
  • 31
  • 59
0
votes
0 answers

Spark.sql not working on EMR (Serverless)

The following script does not create the table in the S3 location indicated by the query. I tested it locally and the Delta Json file was created and contained the information about the table. from pyspark.sql import SparkSession spark =…
anselboero
  • 35
  • 1
  • 3
0
votes
1 answer

how to add columns to a delta table using python able

I have a delta table # Load the data from its source. df = spark.read.load("/databricks-datasets/learning-spark-v2/people/people-10m.delta") # Write the data to a table. table_name = "people_10m" df.write.saveAsTable(table_name) I now have a…
Brian
  • 848
  • 10
  • 32
0
votes
1 answer

reading delta table specific file in folder

I am trying to read a specific file from a folder which contain multiple delta files,Please refer attached screenshot Reason I am looking to read the delta file based on the schema version. The folder mentioned above contains files with different…
SherKhan
  • 84
  • 1
  • 7
0
votes
0 answers

ML Flow : How to access databricks dataframe(or delta table) via API

I have a fairly simple use case in databricks, have a small dataframe which i need to access via api (few api calls in a minute). exploring various ways under databricks ML flow, but not sure whats simple and easiest way to implement, so dataframe…
0
votes
0 answers

Cannot write PySpark DF with delta locally

I'm trying to write a table with delta format on my local machine with the code refer to Delta documentations. import pyspark from delta import * builder = pyspark.sql.SparkSession.builder.appName("MyApp") \ .config("spark.sql.extensions",…
0
votes
1 answer

How to delete rows efficiently in sparksql?

I get a view with IDs for which I have to delete the corresponding records in a table present in a database. View: |columnName| ------------ | 1 | | 2 | | 3 | | 4 | Variables: tableName = name of the table columnName =…
0
votes
0 answers

Facing issue in AWS glue, While trying Delta operations

I am attempting to use the update/Delete/Upsert operation in Pyspark with AWS Glue. I have instantiated spark with below configs: spark = SparkSession.builder.config("spark.sql.extensions",…
0
votes
1 answer

how to synchronize an external database on Spark session

I have a Delta Lake on an s3 Bucket. Since I would like to use Spark's SQL API, I need to synchronize the Delta Lake with the local Spark session. Is there a quick way to have all the tables available, without having to create a temporary view for…
anselboero
  • 35
  • 1
  • 3
0
votes
1 answer

Disk cache to implement the pattern for common and standard user queries

I wanted to know if I explicitly cache a query as below CACHE SELECT * FROM boxes and later run another query like SELECT C1 FROM boxes, will this query be able to use the same cache. Or do we need to have the same query construct to use the disk…
Rajib Deb
  • 1,496
  • 11
  • 30
0
votes
0 answers

Are Isolation Levels of Delta Tables Enforced?

Here, it is described, that delta lake uses Optimistic concurrency control by reading the current state, wirte all changes and validating if there are conflicts (which might end up in throwing an exception). Here the isolation levels are described…
user3579222
  • 1,103
  • 11
  • 28
0
votes
1 answer

Exception occured while writing delta format in AWS S3

I am using spark 3.x, java8 and delta 1.0.0 i.e. delta-core_2.12_1.0.0 in my spark job. data is persisted in AWS S3 path in "delta" format of parquet. Below are details of Jars I am using in my spark job. spark-submit.sh export…
Shasu
  • 458
  • 5
  • 22