Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.

1226 questions

vote

1 answer

pyspark getting distinct values based on groupby column for streaming data

i am trying to get distinct values for a column based on groupby operation on other column using pyspark stream, but i am getting in correct count. Function created: from pyspark.sql.functions import weekofyear,window,approx_count_distinct def…

asked Jun 19 '20 at 05:55

Prabhanj

vote

2 answers

Spark Delta Table Updates

I am working in Microsoft Azure Databricks environment using sparksql and pyspark. So I have a delta table on a lake where data is partitioned by say, file_date. Every partition contains files storing millions of records per day with no…

pyspark databricks azure-databricks delta-lake

asked May 26 '20 at 03:18

SWDeveloper

vote

1 answer

Create SCD2 table from sourcefile that contains multiple updates for one id using Databricks/Spark

I want to make a slowly changing dimension in databricks. My source dataframe contains the following information. +-------------------+-------------------------+----------+-----------+-------------+ | actionimmediately | date |…

apache-spark pyspark databricks azure-databricks delta-lake

asked May 25 '20 at 11:21

WouterSterkens

vote

2 answers

Schema mismatch - Spark DataFrame written to Delta

When writing a dataframe to delta format, the resulting delta does not seem to follow the schema of the dataframe that was written. Specifically, the 'nullable' property of a field seems to be always 'true' in the resulting delta regardless of the…

apache-spark apache-spark-sql delta-lake

asked May 19 '20 at 15:43

NITS

vote

1 answer

Strange non-critical exception when using spark 2.4.3 (emr 5.25.0) with delta lake io 0.6.0

I have been successfully using Spark 2.4.3 - Scala - (in EMR 5.25.0) together with Delta Lake IO 0.6.0. My jobs run fine, but I am doing some optimisations and cleaning the house and noticed this strange exception, which although does not appear to…

apache-spark amazon-emr delta-lake

asked May 14 '20 at 13:49

Carlos Costa

vote

2 answers

PowerBI - supports parquet format from adls gen1

Need to know whether Power BI supports parquet format as source from adls gen1. Planning to use adls gen1 or databricks delta lake(supports parquet format only) as source to get data into power bi.Kindly suggest or please share any documentation…

powerbi powerbi-desktop delta-lake

asked May 11 '20 at 15:52

Prakash

vote

1 answer

How to ingest different spark dataframes in a single spark job

I want to write a ETL pipeline in spark handling different input sources but using as few computing resources as possible and have problem using 'traditional' spark ETL approach. I have a number of streaming datasources which need to be persisted…

apache-spark spark-streaming etl databricks delta-lake

asked May 04 '20 at 20:33

Vyacheslav Krot

vote

2 answers

apache superset connecting to databricks delta lake

I am trying to read data from databricks delta lake via. apache superset. I can connect to delta lake with a JDBC connection string supplied by the cluster but superset seems to require a sql alchemy string so I'm not sure what I need to do to get…

databricks apache-superset delta-lake

asked Apr 03 '20 at 06:10

Henry Dang

vote

1 answer

Is it possible to restore a SQL Server database (or a single table) from backup to Databricks Delta table?

We are planning to migrate some archive SQL Server tables to Databricks Delta tables. Since these are archive and might not change frequently, we thought it might be better to restore them from backup instead of connecting directly through JDBC…

sql-server databricks database-backups azure-databricks delta-lake

asked Apr 02 '20 at 15:02

Saravanan

vote

1 answer

Insert data into databricks delta table with past timestamp

I am exploring DataBricks Delta table and its time travel / temporal feature. I have some events data that happened in the past. I am trying to insert them into delta table and be able to time travel using the timestamp in the data and not the…

databricks delta-lake

asked Feb 18 '20 at 15:12

kbahadur

vote

1 answer

Custom metadata/tags for Delta Lake?

I'm trying to tie two tables' versions together. Like if table A's version 1 was used to generate table B's version 3, I want to be able to tell that. Is there something already exist in Delta Lake that can do this functionality easily? I think…

delta-lake

asked Jan 17 '20 at 14:17

pikapoo

vote

0 answers

How to streamout or extract only inserts/adds from a Databricks delta file?

I have a scenario, where I want to run a Spark Structured Streaming job to read a Databricks Delta Source File and extract only the inserts to the Source file. I want to filter-out any Updates/Deletes. I was trying following on a smaller file but…

apache-spark spark-structured-streaming azure-databricks delta-lake

asked Jan 11 '20 at 02:34

Don Sam

vote

0 answers

Exception in thread "main" java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper

Exception in thread "main" java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.experimental.ScalaObjectMapper.$init$(Lcom/fasterxml/jackson/module/scala/experimental/ScalaObjectMapper;)V While trying to write a df as a delta table in…

scala apache-spark delta-lake

asked Dec 27 '19 at 04:47

Raptor0009

vote

2 answers

How to control the file num in Delta Lake merge output

I'm using Delta Lake 0.4.0 with Merge like: target.alias("t") .merge( src.as("s"), "s.id = t.id ) .whenMatched().updateAll() .whenNotMatched().insertAll() .execute() src…

apache-spark partition delta-lake

asked Nov 19 '19 at 06:42

processadd

vote

2 answers

How to connect RStudio in Azure Databricks to Delta Lake

Is there a way to connect RStudio that is on Azure Databricks Cluster to Delta Lake / Delta tables? (read and write mode would be awesome). In RStudio on cluster I tried to set up the path to home directory: - dbfs:/mnt/20_silver/ -…

rstudio sparkr azure-databricks delta-lake

asked Nov 15 '19 at 14:48

Iskandel

Prev 1 2 3

…

81 82 Next