2

I've just found a new Databricks feature called Databricks Data Ingestion. There is very little material about it at this point.

When I should use Databricks Data Ingestion instead of existing mature tools like Azure Data Factory (ADF) or Apache Nifi?

Both ADF and Nifi can ingest into ADLS/S3, and AFAIK ADLS/S3 can be mapped to Databricks DBFS without copying any data, and parquet files can be easily converted into Delta format. So what is the benefit or use cases for using new tool? What I am missing?

VB_
  • 45,112
  • 42
  • 145
  • 293

1 Answers1

1

There are three items in the blog post.

  1. Auto Loader
  2. COPY INTO
  3. Data Ingestion from 3rd party sources

Auto Loader and COPY INTO simplify state management of your data ingest pipeline. What I mean by state management is the management of what files or events have been ingested and processed. With Nifi, Airflow, ADF, you need a separate state store to track which files have been ingested or not. ETL systems often 'move' ingested files to another folder. This is still state management. Others might track the file in a database or a no-sql data store.

Before Auto Loader or COPY INTO, you would have to: 1. Detect files in landing zone 2. Compare file with files already ingested. 2. Present the file for processing 3. Track which file you've ingested.

If these steps get messed up, fall behind, then a file might be ingested and processed twice or lost. Moving files has a cost in complexity.

With Auto Loader or COPY INTO, in one statement, you can setup a streaming or incremental data ingest. Set an archive policy on the landing zone for 7 days or 48 hours, your landing zone clears itself automatically. Your code & architecture is greatly simplified.

Auto Loader (for streams) and COPY INTO (for re-curing batch jobs) utilize Databricks under the covers to track and manage the state for you. For Auto Loader, Databricks will setup the infrastructure, SNS, SQS that greatly reduces the latency of your streaming data ingest.

The Third item in the blog post, is an announcement of partnerships with well established data ingest companies that have a wide array of out of the box connectors to your enterprise data. These companies work with Delta Lake.

You would still use Nifi, Airflow, Stream Sets etc. to acquire the data from source systems. Those tools would now only trigger the 'COPY INTO' command as needed for batch / micro batch ingest. Auto Loader will just run continuously or run when triggered.

Douglas M
  • 1,035
  • 8
  • 17
  • thank you for the answer! I have a confusion about ingested state management. *Could you please bring an example when ingested data state management is cumbersome?* On batch you can ingest into `yyyy/mm/dd/hh` partitions, while on stream Kafka/Kinesis do state management for you. For now I see the only major benefit when "new" state management wins - when you want to do both batch & stream on single Delta Table. Could you please bring any other benefits over `yyyy/mm/dd/hh` partitioning? – VB_ May 17 '20 at 21:43
  • 1
    Sure, you can place files into partitions, but what if you're late in placing the file into the partition? Also, the partition implies you have to wait till the end of the time period associated with the partition before beginning your processing. In an hourly partitioning scheme, your ingestion may be delayed by up to 1hr. Also, what if you begin writing your object on the last second of the hour, but it takes 10 seconds to write the object, or 100 seconds, or 1000 seconds to write the file? Then you might miss ingesting that object. State manage intends to solve these complexities. – Douglas M May 20 '20 at 01:01
  • thanks for explanation! I think I got it. The only concern is traceability. Suppose you're using Databricks Ingest and you have partition `2020/05/19/13`. You proccessed that partition at 2 p.m. At 3 p.m. new row (call it `X` row) was ingested to that partition, everything is managed for you. But now the confusion may appear, you have some derived data that didn't took into account row `X`. In turn, row `X` is inside ingested data and it's hard to figure out based on which data the processing was happen. It's natural to wait for late arrival or reprocess every time – VB_ May 20 '20 at 01:09
  • 1
    Start to think of it as a stream. Every time you run `COPY INTO` you are ingesting data into a Delta Lake table from any new files. The data will appear into your table with the next query. `input_file_name()` function would tell you the full path of the file related to row `X`. It's definitely breaks a bit from traditional Batch ETL patterns, but should simplify the overall job. – Douglas M May 20 '20 at 01:18