Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
0 answers

Azure Synapse serveless doesn't read delta

I've around 90 delta views on Synapse serveless, 90% of them works flawless but some of them don't. Databricks and hive shows all results correctly but on Synapse I getting error message when I try to read delta and no rows, if I write same view…
Alvaro
  • 79
  • 1
  • 6
0
votes
1 answer

Delta Lake Data Load Datatype mismatch

I am loading data from SQL Server to Delta lake tables. Recently i had to repoint the source to another table(same columns), but the data type is different in new table. This is causing error while loading data to delta table. Getting following…
Vaishak
  • 607
  • 3
  • 8
  • 30
0
votes
0 answers

How to update a delta table with the missing row using PySpark?

I need to update delta table on the basis of update delta table rows. Update table (source_df) +---------------------------------------------------------- |ID| …
0
votes
0 answers

Creating AWS Lake Formation Blueprint resource using terraform

I have requirement to create a "Blueprint” under aws lakeformation using terraform. I can't seem to find anything on the official terraform docs to support this. Ideally I would need a separate resource to create the entire blueprint. Something…
0
votes
0 answers

delta lake merge missing reocords

I am excuting delta Lake function on aws. However, I am not getting the correct result. below is the pyspark script. It ran successfully. However, the output contains less records than the origianl table. …
0
votes
1 answer

How to resolve `Value for one of the query parameters specified in the request URI is invalid` error?

I am trying to create a parquet file in an ALDS gen2 container but it is failing with below error Status code 400, "{"error":{"code":"InvalidQueryParameterValue","message":"Value for one of the query parameters specified in the request URI is…
0
votes
0 answers

How to read delta tables 2.1.0 in S3 bucket that contains symlink_format_manifest by using AWS glue studio 4.0?

I am using Glue Studio 4.0 to choose data source (delta table 2.1.0 that saved in S3) as image below: And then, I generate script from the box: import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions …
0
votes
0 answers

Apache Spark with Delta Lake: DataFrame.show() is not responding

I create a spark cluster environment and i am facing a freezing issue when i try to show a dataframe, builder = SparkSession.builder \ .appName('Contabilidade > Conta Cosif') \ .config("spark.jars",…
0
votes
0 answers

ModuleNotFoundError: No module named 'delta.tables'; 'delta' is not a package

I'm trying to get my PySpark to work with Delta table. I did "pip install delta" as well as "pip install delta-spark" This is my delta.py script: from delta.tables import * from pyspark.sql.functions import * deltaTable = DeltaTable.forPath(spark,…
Eugene Goldberg
  • 14,286
  • 20
  • 94
  • 167
0
votes
1 answer

How to fix org.apache.spark.sql.internal.SQLConf$.PARQUET_FIELD_ID_READ_ENABLED() when running Spark with Delta Lake?

I am following the tutorial here on how to access delta lake house with spark, but can't seem to get it to work. I have the following dependencies:
Finlay Weber
  • 2,989
  • 3
  • 17
  • 37
0
votes
1 answer

Synapse serverless pool to query delta table previous versions

Can we use Synapse serverless pool (Built-in) to query a delta file's previous version? I am keen to a SQL statement similar to what we do in Databricks: select * from delta.`/my_dir` version as of 2 Does the OPENROWSET support support a "version…
QPeiran
  • 1,108
  • 1
  • 8
  • 18
0
votes
1 answer

Schema mismatch on insert using Delta (spark)

I've started playing around with Delta on EMR 6.9 and I'm attempting to just perform a few basic operation for suitability. When I use Spark Sql to create a table and then insert data I'm given an error: An error was encountered: A schema mismatch…
Ed Baker
  • 643
  • 1
  • 6
  • 16
0
votes
1 answer

Azure Data Factory DataFlow Error: Key partitioning does not allow computed columns

We have a generic dataflow that works for many tables, the schema is detected at runtime. We are trying to add a Partition Column for the Ingestion or Sink portion of the delta. We are getting error: Azure Data Factory DataFlow Error: Key…
GVFLUSA
  • 25
  • 4
0
votes
2 answers

Glue not able to recognize Delta Lake Python Library

I am trying to use Delta Lake Python Library in my Glue job. However, my Glue job is not able to recognize it and I get the error "NameError: name 'DeltaTable' is not defined". Per Glue-DeltaLake documentation , I added the paramter…
Jatin
  • 75
  • 8
0
votes
1 answer

converting a DateTime column to string in ADF

I am trying to build a fully parametrised pipeline template in ADF. With the work I have done so far, I can do a full load without any issues but when it comes to delta load, it seems like my queries are not working. I believe the reason for this is…
newbie
  • 53
  • 10