Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
0 answers

mismatched input 'NOT' expecting {, ';'} using sql merge into

I am trying to merge an pyspark df (which I set up like this self.df.createOrReplaceTempView("df")) and a delta table, that is saved at path = '`/example/example/example/`' The path is working and the TempView also. I am getting following…
Bondgirl
  • 107
  • 7
0
votes
0 answers

ERROR: org.apache.spark.sql.execution.datasources.FileFormatWriter$.write

I am running on following config: Cluster type: E64_v3 (1 driver + 3 workers) other spark cnfigs: spark.shuffle.io.connectionTimeout 1200s spark.databricks.io.cache.maxMetaDataCache 40g spark.rpc.askTimeout 1200s…
user3868051
  • 1,147
  • 2
  • 22
  • 43
0
votes
1 answer

Column separator mismatch when reading Parquet dataset into H2OFrame after conversion from Delta to Parquet

I am attempting to read a multi-file Parquet dataset into an H2OFrame and it results in a column mismatch error: H2OResponseError: Server error water.exceptions.H2OIllegalArgumentException: Error: Column separator mismatch. One file seems to use…
James Adams
  • 8,448
  • 21
  • 89
  • 148
0
votes
1 answer

Issue while reading delta file placed under wasb storage from mapr cluster

I am trying to read delta format file from azure storage using code below in jupyter notebook which is running in mapr cluster. when i am running this code it is throwing issue that java.lang.NoSuchMethodException:…
Shalaj
  • 579
  • 8
  • 19
0
votes
2 answers

How to create a blank "Delta" Lake table schema in Azure Data Lake Gen2 using Azure Synapse Serverless SQL Pool?

I have a file with data integrated from 2 different sources using Azure Mapping Data Flow and loaded into an ADLS2 datalake container/folder i.e. for example :- /staging/EDW/Current/products.parquet file. I now need to process this file in staging…
0
votes
1 answer

How to limit a table based on 2 categories from a particular column in SQL / Hive SQL

I have a table which contains a column called region, the region contains 2 values - Mexico and USA. I want to download a sample subset of the table, but what I need is the values should be based on both USA and MEXICO. I have tried these queries -…
Mohseen Mulla
  • 542
  • 7
  • 15
0
votes
1 answer

Combining delta io and excel reading

When using com.crealytics:spark-excel_2.12:0.14.0 without delta: spark = SparkSession.builder.appName("Word Count") .config("spark.jars.packages", "com.crealytics:spark-excel_2.12:0.14.0") .getOrCreate() df =…
OMA
  • 543
  • 2
  • 6
  • 18
0
votes
1 answer

How to dynamically pass save_args to kedro catalog?

I'm trying to write delta tables in Kedro. Changing file format to delta makes the write as delta tables with mode as overwrite. Previously, a node in the raw layer (meta_reload) creates a dataset that determines what's the start date for…
Sandeep Gunda
  • 171
  • 1
  • 10
0
votes
0 answers

Grouping streaming data in pyspark

I have two delta tables I'm reading, joining and writing. They both have timestamps, so I'm using those as watermarks and I can join the data without problems. However, when I try to group it, the stream doesn't write anything to the delta…
Stefan
  • 2,098
  • 2
  • 18
  • 29
0
votes
1 answer

Databricks Serverless Computer - writeback to delta tables

Databricks Serverless Compute - I know this is still in preview and is by request and is only available on AWS. Can this be used for Read and Write (Update) .delta tables [or] is it read-only? And is it good to run small queries (transactional in…
0
votes
1 answer

Using ALTER TABLE to add new column into Array(Struct) Column on Databricks

I have a DeltaTable with several columns that are ArrayTypes, containing StructTypes. I'm trying to add an extra column into the StructType, but I am running into issues because it is wrapped in an ArrayType. Hoping someone has a way to do this, or…
0
votes
1 answer

Error while creating a table from another one in Apache Spark

I'm creating the table the following way: spark.sql("CREATE TABLE IF NOT EXISTS table USING DELTA AS SELECT * FROM origin") But i get this error: Exception in thread "main" org.apache.spark.SparkException: Table implementation does not support…
NachoAG
  • 3
  • 6
0
votes
1 answer

Spark Delta merge add Source column value to Target column value

I want the updated value in the target's column to be the sum of source value + target value example: %scala import org.apache.spark.sql.functions._ import io.delta.tables._ // Create example delta table val dept = Seq(("Finance",10),…
Gadam
  • 2,674
  • 8
  • 37
  • 56
0
votes
1 answer

Map write operation to groups of Dataframe rows to different delta tables

I have a Dataframe with rows which will be saved to different target tables. Right now, I'm finding the unique combination of parameters to determine the target table, iterating over the Dataframe and filtering, then writing. Something similar to…
TomNash
  • 3,147
  • 2
  • 21
  • 57
0
votes
1 answer

Synapse Query from Delta Lake Databricks with SSMS

I would to know if there are any way to query table Delta format from blob container that was created using Databricks with SSMS or Azure Data Studio which is SSMS are connected to Azure Synapse, I've tried to query with this query SELECT TOP(10) *…
MADFROST
  • 1,043
  • 2
  • 11
  • 29