Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.

1226 questions

votes

1 answer

Execution error : field item_size: ArrayType(DoubleType,true) can not accept object array([19. , 5. , 6.5]) in type

when I am trying to create a DeltaTable, using delta.io library, in AWS Glue I get this error: Execution error : field item_size: ArrayType(DoubleType,true) can not accept object array([19. , 5. , 6.5]) in type Here some…

asked Jan 09 '23 at 15:06

Cotrariello

votes

1 answer

Databricks truncate delta table restart identity 1

We are created SQL notebook in Databricks and we are trying to develop onetime script. we have to truncate and load the data every time and the table sequence id generated always start with 1. if we do truncate and load the data. the sequence of id…

azure-databricks delta-lake databricks-sql delta-live-tables

asked Jan 05 '23 at 10:29

Developer KE

votes

0 answers

How to load the latest version of delta parquet using spark?

I have access to a repository where a team writes parquet file (without partitioning them), using delta (i.e there is a delta log in this repository). I have no access to the table itself though. To create a dataframe from those parquet, I am using…

apache-spark pyspark apache-spark-sql parquet delta-lake

asked Jan 05 '23 at 08:12

V.Leymarie

2,708
2
11
18

votes

0 answers

How to stream data from Delta Table to Kafka Topic

Internet is filled with examples of streaming data from Kafka topic to delta tables. But my requirement is to stream data from Delta Table to Kafka topic. Is that possible? If yes, can you please share code example? Here is the code I tried. val…

apache-spark apache-kafka databricks spark-structured-streaming delta-lake

asked Jan 05 '23 at 01:50

Don Sam

votes

1 answer

Cannot read Delta tables created by Spark in Hive or Dbeaver/JDBC

I've used Spark 3.3.1, configured with delta-core_2.12.2.2.0 and delta-storage-2.2.0, to create several tables within an external database. spark.sql("create database if not exists {database}.{table} location {path_to_storage}") Within that…

apache-spark pyspark hive delta-lake

asked Jan 04 '23 at 16:33

Joe Ingle

votes

1 answer

spark-sql/spark-submit with delta lake is resulting null pointer exception (at org.apache.spark.storage.BlockManagerMasterEndpoint)

I'm using delta lake on using pyspark by submitting below command spark-sql --packages io.delta:delta-core_2.12:0.8.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf…

apache-spark pyspark delta-lake spark-submit spark-shell

asked Dec 29 '22 at 15:46

Vinod R

votes

0 answers

Databricks Delta - Error: Overlapping auth mechanisms using deltaTable.detail()

In Azure Databricks. I have a unity catalog metastore created on ADLS on its own container (metastore@stgacct.dfs.core.windows.net/) connected w/ the Azure identity. Works fine. I have a container on the same storage account called data. I'm using…

databricks azure-databricks delta-lake

asked Dec 27 '22 at 20:57

ExoV1

votes

1 answer

Reading a Delta Table with no Manifest File using Redshift

My goal is to read a Delta Table on AWS S3 using Redshift. I've read through the Redshift Spectrum to Delta Lake Integration and noticed that it mentions to generate a manifest using Apache Spark using: GENERATE symlink_format_manifest FOR TABLE…

amazon-redshift apache-flink amazon-athena delta-lake

asked Dec 26 '22 at 07:44

Adrian David Smith

votes

1 answer

Z order column in databricks table

I am working on creating a notebook which end users could run by providing the table name as input and get an efficient sample query(by utilising the partition key and Z order column). I could get the partition column with describe table or…

databricks delta-lake

asked Dec 23 '22 at 14:06

Nikesh

votes

1 answer

AttributeError: 'DataFrameWriter' object has no attribute 'schema'

I will like to write a Spark Dataframe with a fix schema. I m trying that: from pyspark.sql.types import StructType, IntegerType, DateType, DoubleType, StructField my_schema = StructType([ StructField("seg_gs_eur_am", DoubleType()), …

pyspark delta-lake pyspark-schema

asked Dec 23 '22 at 12:26

Enrique Benito Casado

1,914
1
20
40

votes

1 answer

case when in merge statement databricks

I am trying to upsert in Databricks using merge statement in pyspark. I wanted to know if using expressions (e.g. adding two columns, case when) allowed in the whenMatchedUpdate part. For example I want to do something like this deltaTableTarget =…

pyspark databricks delta-lake

asked Dec 22 '22 at 10:13

Abhishek

votes

0 answers

Why are the log files of modern data storage formats like Delta and Apache Kafka topics left-padded with zeros up to 20 digits?

I am a data engineer who uses both Apache Kafka for streaming as well as Delta for storage on lakes. This is more of a question to feed my curiosity. I can see both the Delta transaction log files (which are of .json extension ) as well as the Kafka…

apache-kafka delta-lake

asked Dec 20 '22 at 21:57

akhil pathirippilly

votes

0 answers

Create folder wise structure in Delta Format on HDFS

I am consuming kafka data having "eventtime" (datetime) field in the packet. I want to create hdfs directories in "year/month/day" structure in streaming based on the date part of the eventtime field . I am using delta-core_2.11:0.6.1, Spark : 2.4…

delta-lake

asked Dec 20 '22 at 18:46

Sukrit Mehta

votes

0 answers

How to prevent memory leak (OOM) when running two streams in one spark app?

I have a two streaming query in one app: landing to cleaning zone: Move new data on landing zone(raw data format) to cleaning zone and save them as a delta format Read log data from kafka and joining with delta format table(cleaning zone) and save…

apache-spark delta-lake

asked Dec 18 '22 at 14:46

user3595632

5,380
10
55
111

votes

1 answer

How can I use delta cache function in local spark cluster mode?

I'd like to test delta-cache in local cluster mode (jupyter) 1. What I want to do: Whole delta-formatted files aren't re-downloaded every time, only new data will be re-downloaded 2. What I've tried ... #…

apache-spark databricks delta-lake

asked Dec 18 '22 at 14:13

user3595632

5,380
10
55
111

Prev 1 2 3

…

81 82 Next