Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
-1
votes
1 answer

Formatting character in deltalake

columnName (FORMAT'99999999.999') (CHAR(12)) How can I format this in deltalake , I searched for to_char method but there is no such method . Can anyone help me please !
-1
votes
1 answer

PowerBI connection to Spark Delta Lake

I am trying to connect the PowerBI Desktop (running on Windows) to the delta lake table in a Spark Cluster (running on Linux). I've mounted the delta lake table folder (with parque files) to Windows box via Samba. Now, how should I add a data source…
ozmhsh
  • 61
  • 2
  • 8
-1
votes
1 answer

SQL for UTC Conversion

I have 2 columns in a table as follows, TIMESTAMP TIMEZONE 2020-08-20T02:36:52.000+0000 PST 2020-08-20T02:36:52.000+0000 GMT 2020-08-20T02:36:52.000+0000 CST Now I want to convert those timestamp column…
Lokesh
  • 87
  • 2
  • 11
-1
votes
1 answer

Cant write a DF as delta table i spark 2.4.4 and scala 2.12 java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala

Cant write a DF as delta table i spark 2.4.4 and scala 2.12 reading a parquet file as DF trying to write it as a delta table. code val dF=spark.read.load("path") //parquet…
Raptor0009
  • 258
  • 4
  • 14
-2
votes
1 answer

Compare the two versions of dataframes for Upsert and list out changes

I am doing upsert operation in databricks. Now I want to check what is changed between two upsert operation. My original df1 look like this>> My upserted df2 look like this >> I want Output like this>> here id is my primary_key
Suraj Shejal
  • 640
  • 3
  • 19
-2
votes
1 answer

How to list all tables by searching a given column name in spark or deltalake

I'm looking for metadata table which holds all column name, table names, creation timestamps within spark sql and delta lake. I need to be able to search by a given column name and list all the tables having that column name.
Mauryas
  • 41
  • 2
  • 8
-3
votes
1 answer

Is there a CDAP / Data Fusion plugin for transforming to Delta (Delta Lake) Format?

I'd like to use Data fusion on GCP as my ETL pipeline manager and store the raw data in GCS using the delta format. Has anyone done this, or does a plugin exist?
-4
votes
1 answer

How to write SQL query to read most recent timestamps (using window function)?

I have a table in my database that has rows inserted without updating old ones. That leads to having records with same ID but diffrent timestamps. How to write SQL query that uses window function to read rows with distinct IDs and most recent…
1 2 3
81
82