Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

  • ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
  • Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
  • Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
1226 questions
1
vote
1 answer

pyspark getting distinct values based on groupby column for streaming data

i am trying to get distinct values for a column based on groupby operation on other column using pyspark stream, but i am getting in correct count. Function created: from pyspark.sql.functions import weekofyear,window,approx_count_distinct def…
1
vote
2 answers

Spark Delta Table Updates

I am working in Microsoft Azure Databricks environment using sparksql and pyspark. So I have a delta table on a lake where data is partitioned by say, file_date. Every partition contains files storing millions of records per day with no…
SWDeveloper
  • 319
  • 1
  • 4
  • 14
1
vote
1 answer

Create SCD2 table from sourcefile that contains multiple updates for one id using Databricks/Spark

I want to make a slowly changing dimension in databricks. My source dataframe contains the following information. +-------------------+-------------------------+----------+-----------+-------------+ | actionimmediately | date |…
1
vote
2 answers

Schema mismatch - Spark DataFrame written to Delta

When writing a dataframe to delta format, the resulting delta does not seem to follow the schema of the dataframe that was written. Specifically, the 'nullable' property of a field seems to be always 'true' in the resulting delta regardless of the…
NITS
  • 207
  • 4
  • 15
1
vote
1 answer

Strange non-critical exception when using spark 2.4.3 (emr 5.25.0) with delta lake io 0.6.0

I have been successfully using Spark 2.4.3 - Scala - (in EMR 5.25.0) together with Delta Lake IO 0.6.0. My jobs run fine, but I am doing some optimisations and cleaning the house and noticed this strange exception, which although does not appear to…
Carlos Costa
  • 195
  • 1
  • 11
1
vote
2 answers

PowerBI - supports parquet format from adls gen1

Need to know whether Power BI supports parquet format as source from adls gen1. Planning to use adls gen1 or databricks delta lake(supports parquet format only) as source to get data into power bi.Kindly suggest or please share any documentation…
Prakash
  • 281
  • 5
  • 18
1
vote
1 answer

How to ingest different spark dataframes in a single spark job

I want to write a ETL pipeline in spark handling different input sources but using as few computing resources as possible and have problem using 'traditional' spark ETL approach. I have a number of streaming datasources which need to be persisted…
1
vote
2 answers

apache superset connecting to databricks delta lake

I am trying to read data from databricks delta lake via. apache superset. I can connect to delta lake with a JDBC connection string supplied by the cluster but superset seems to require a sql alchemy string so I'm not sure what I need to do to get…
Henry Dang
  • 25
  • 1
  • 3
1
vote
1 answer

Is it possible to restore a SQL Server database (or a single table) from backup to Databricks Delta table?

We are planning to migrate some archive SQL Server tables to Databricks Delta tables. Since these are archive and might not change frequently, we thought it might be better to restore them from backup instead of connecting directly through JDBC…
1
vote
1 answer

Insert data into databricks delta table with past timestamp

I am exploring DataBricks Delta table and its time travel / temporal feature. I have some events data that happened in the past. I am trying to insert them into delta table and be able to time travel using the timestamp in the data and not the…
kbahadur
  • 11
  • 2
1
vote
1 answer

Custom metadata/tags for Delta Lake?

I'm trying to tie two tables' versions together. Like if table A's version 1 was used to generate table B's version 3, I want to be able to tell that. Is there something already exist in Delta Lake that can do this functionality easily? I think…
pikapoo
  • 77
  • 7
1
vote
0 answers

How to streamout or extract only inserts/adds from a Databricks delta file?

I have a scenario, where I want to run a Spark Structured Streaming job to read a Databricks Delta Source File and extract only the inserts to the Source file. I want to filter-out any Updates/Deletes. I was trying following on a smaller file but…
1
vote
0 answers

Exception in thread "main" java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper

Exception in thread "main" java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper.$init$(Lcom/fasterxml/jackson/module/scala/experimental/ScalaObjectMapper;)V While trying to write a df as a delta table in…
Raptor0009
  • 258
  • 4
  • 14
1
vote
2 answers

How to control the file num in Delta Lake merge output

I'm using Delta Lake 0.4.0 with Merge like: target.alias("t") .merge( src.as("s"), "s.id = t.id ) .whenMatched().updateAll() .whenNotMatched().insertAll() .execute() src…
processadd
  • 101
  • 3
  • 5
1
vote
2 answers

How to connect RStudio in Azure Databricks to Delta Lake

Is there a way to connect RStudio that is on Azure Databricks Cluster to Delta Lake / Delta tables? (read and write mode would be awesome). In RStudio on cluster I tried to set up the path to home directory: - dbfs:/mnt/20_silver/ -…
Iskandel
  • 11
  • 2