Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
0
votes
1 answer

Execution error : field item_size: ArrayType(DoubleType,true) can not accept object array([19. , 5. , 6.5]) in type

when I am trying to create a DeltaTable, using delta.io library, in AWS Glue I get this error: Execution error : field item_size: ArrayType(DoubleType,true) can not accept object array([19. , 5. , 6.5]) in type Here some…
0
votes
1 answer

Databricks truncate delta table restart identity 1

We are created SQL notebook in Databricks and we are trying to develop onetime script. we have to truncate and load the data every time and the table sequence id generated always start with 1. if we do truncate and load the data. the sequence of id…
0
votes
0 answers

How to load the latest version of delta parquet using spark?

I have access to a repository where a team writes parquet file (without partitioning them), using delta (i.e there is a delta log in this repository). I have no access to the table itself though. To create a dataframe from those parquet, I am using…
V.Leymarie
  • 2,708
  • 2
  • 11
  • 18
0
votes
0 answers

How to stream data from Delta Table to Kafka Topic

Internet is filled with examples of streaming data from Kafka topic to delta tables. But my requirement is to stream data from Delta Table to Kafka topic. Is that possible? If yes, can you please share code example? Here is the code I tried. val…
0
votes
1 answer

Cannot read Delta tables created by Spark in Hive or Dbeaver/JDBC

I've used Spark 3.3.1, configured with delta-core_2.12.2.2.0 and delta-storage-2.2.0, to create several tables within an external database. spark.sql("create database if not exists {database}.{table} location {path_to_storage}") Within that…
Joe Ingle
  • 11
  • 2
0
votes
1 answer

spark-sql/spark-submit with delta lake is resulting null pointer exception (at org.apache.spark.storage.BlockManagerMasterEndpoint)

I'm using delta lake on using pyspark by submitting below command spark-sql --packages io.delta:delta-core_2.12:0.8.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf…
0
votes
0 answers

Databricks Delta - Error: Overlapping auth mechanisms using deltaTable.detail()

In Azure Databricks. I have a unity catalog metastore created on ADLS on its own container (metastore@stgacct.dfs.core.windows.net/) connected w/ the Azure identity. Works fine. I have a container on the same storage account called data. I'm using…
ExoV1
  • 97
  • 1
  • 7
0
votes
1 answer

Reading a Delta Table with no Manifest File using Redshift

My goal is to read a Delta Table on AWS S3 using Redshift. I've read through the Redshift Spectrum to Delta Lake Integration and noticed that it mentions to generate a manifest using Apache Spark using: GENERATE symlink_format_manifest FOR TABLE…
0
votes
1 answer

Z order column in databricks table

I am working on creating a notebook which end users could run by providing the table name as input and get an efficient sample query(by utilising the partition key and Z order column). I could get the partition column with describe table or…
Nikesh
  • 47
  • 6
0
votes
1 answer

AttributeError: 'DataFrameWriter' object has no attribute 'schema'

I will like to write a Spark Dataframe with a fix schema. I m trying that: from pyspark.sql.types import StructType, IntegerType, DateType, DoubleType, StructField my_schema = StructType([ StructField("seg_gs_eur_am", DoubleType()), …
Enrique Benito Casado
  • 1,914
  • 1
  • 20
  • 40
0
votes
1 answer

case when in merge statement databricks

I am trying to upsert in Databricks using merge statement in pyspark. I wanted to know if using expressions (e.g. adding two columns, case when) allowed in the whenMatchedUpdate part. For example I want to do something like this deltaTableTarget =…
Abhishek
  • 83
  • 10
0
votes
0 answers

Why are the log files of modern data storage formats like Delta and Apache Kafka topics left-padded with zeros up to 20 digits?

I am a data engineer who uses both Apache Kafka for streaming as well as Delta for storage on lakes. This is more of a question to feed my curiosity. I can see both the Delta transaction log files (which are of .json extension ) as well as the Kafka…
akhil pathirippilly
  • 920
  • 1
  • 7
  • 25
0
votes
0 answers

Create folder wise structure in Delta Format on HDFS

I am consuming kafka data having "eventtime" (datetime) field in the packet. I want to create hdfs directories in "year/month/day" structure in streaming based on the date part of the eventtime field . I am using delta-core_2.11:0.6.1, Spark : 2.4…
0
votes
0 answers

How to prevent memory leak (OOM) when running two streams in one spark app?

I have a two streaming query in one app: landing to cleaning zone: Move new data on landing zone(raw data format) to cleaning zone and save them as a delta format Read log data from kafka and joining with delta format table(cleaning zone) and save…
user3595632
  • 5,380
  • 10
  • 55
  • 111
0
votes
1 answer

How can I use delta cache function in local spark cluster mode?

I'd like to test delta-cache in local cluster mode (jupyter) 1. What I want to do: Whole delta-formatted files aren't re-downloaded every time, only new data will be re-downloaded 2. What I've tried ... #…
user3595632
  • 5,380
  • 10
  • 55
  • 111