Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
1 answer

Data Architecture - Full Azure Stack vs Integrated Delta Lake

A friend's company is working on a Data architecture which to us, it seems to be rather convoluted and having several scalability and cost problems. If possible, I would like to have your opinion on the old and proposed architectures (or…
HGSF
  • 57
  • 5
0
votes
2 answers

databricks overwriting entire table instead of adding new partition

I have this table CREATE TABLE `db`.`customer_history` ( `name` STRING, `addrress` STRING, `filename` STRING, `dt` DATE) USING delta PARTITIONED BY (dt) When I use this to load a partition data to the table df .write …
Srinivas
  • 2,010
  • 7
  • 26
  • 51
0
votes
1 answer

Why can't i query synapse serverless view from azure data studio or databricks?

When i query my delta table from synpase studio, i can see the data all good. But when i am connecting trough data studio via sql login (or databricks), It seems that i can not query it : CREATE or alter view stock as SELECT * FROM OPENROWSET( …
0
votes
1 answer

how to delete the data from the Delta Table?

I was actually trying to delete the data from the Delta table. When i run the below query, I'm getting data around 500 or 1000 records. SELECT * FROM table1 inv join (SELECT col1, col2, col2, min(Date) minDate, max(Date) maxDate FROM table2 a GROUP…
Tony
  • 301
  • 3
  • 10
0
votes
0 answers

Store latest state of data on S3 using Spark

I am working on Spark-EMR job. My requirement is to read data from s3 every hour, do the same flatten transformation, and save the latest state of data based on machine-id. I can get the same machine-id data in the next hour also so I need to…
lucy
  • 4,136
  • 5
  • 30
  • 47
0
votes
1 answer

Databricks - Save partitioned CSV files into respective tables

Like to share my requirement, and how best it can be solved. I have an SQL query, say, "SQL_QUERY_RUNS_AND_GIVES_RESULT_SET" which runs and passes the result set to a dataframe. Since the result set is huge, I create several partitions out of it and…
0
votes
0 answers

Returning Delta Results from Serverless SQL Select Statements returns Invalid Object error

I am working through the samples in this document from Microsoft. Linux Foundation Delta Lake overview The sections on table creation work of Notebooks works in that I have tables created in the default database as expected in accordance with the…
Steve-at-sword
  • 63
  • 1
  • 11
0
votes
1 answer

Is it possible that some parquet files were moved or deleted from ADLS Gen2 without this being reflected in the logs?

I am coming here as i'm encoutering some strange issues with blob storage (and delta table, but i believe issues come from blob storage). We are encoutering this classical delta lake error, : when somebody manually delete some files from the storage…
0
votes
1 answer

How do I get the jar file for Delta Lake 1.0.0 Library

I use Delta lake for doing upserts to my data in my Glue jobs. I usually put the jar file in S3 and use that location in Glue job. I currently use Delta lake 0.6.1, for which I got the jar file from somewhere I don't remember now. The problem is it…
Harish J
  • 146
  • 1
  • 3
  • 12
0
votes
1 answer

How to query latest version of Delta Lake table in Azure Synapse?

How can we use Azure Synapse serverless SQL pool to query the latest version of Delta Lake table ?? Below link specifies it can be done under Delta Lake, but unable to find any…
ManiK
  • 377
  • 1
  • 21
0
votes
1 answer

How to pass parameters for parameterized linked services for inline datasets (delta) to dataflow?

I have a delta data source in dataflow. In order to connect to it, I need to use a parameterized linked service; however, I cannot find where I can address the values for the linked service parameters: The parameters are highlighted in the the…
ARCrow
  • 1,360
  • 1
  • 10
  • 26
0
votes
1 answer

pyspark foreachBatch reading same data again after restarting the stream

df = spark.readStream.option("readChangeFeed", "true").option("startingVersion", 2).load(tablePath) def foreach_batch_function(df, epoch_id): print("epoch_id: ", epoch_id) …
Akhilesh Jaiswal
  • 227
  • 2
  • 14
0
votes
0 answers

How to Connect to Databricks table from c# code using JBDC/ODBC?

I have created a c# windows service and want to Connect to Databricks Delta table from c# code using JDBC/ODBC connection with below host for updating/inserting in couple of databricks delta tables. Trying using SIMBA ODBC connection but got the…
V J
  • 77
  • 12
0
votes
1 answer

Databricks Delta table Alter column for decimal(10,0) to decimal(38,18) conversion not working

In Databricks, the table is created using the schema json definition. schema json used to create table { "fields": [ { "metadata": {}, "name": "username", "nullable": true, "type": "string" }, { …
Tim
  • 1,321
  • 1
  • 22
  • 47
0
votes
1 answer

Is there a way to tell before the write how many files will be created when saving Spark Dataframe as Delta Table in Azure Data Lake Storage Gen1?

I am currently trying to save a Spark Dataframe to Azure Data Lake Storage (ADLS) Gen1. While doing so I recevie the following throttling error: org.apache.spark.SparkException: Job aborted. Caused by:…
DataBach
  • 1,330
  • 2
  • 16
  • 31