Questions tagged [databricks-autoloader]

69 questions
1
vote
1 answer

Databricks Auto Loader with Merge Condition

We have the following merge-to-delta function. The merge​ function ensures we update the record appropriately based on certain conditions. So, in the function usage, you can see we define the merge condition and pass it into the function. This…
1
vote
2 answers

How Databricks autoloader identify new files when cluster is not active?

If my cluster is not active, and I have uploaded 50 files in storage location, then where this Auto Loader will list out these 50 files if cluster is not active. Will it use any checkpoint location, if yes, then how can I set the checkpoint location…
1
vote
1 answer

Databricks - Autoloader - Not Terminating?

I'm new to databricks and I have several azure blob .parquet locations I'm pulling data from and want to put through the autoloader so I can "create table ... using delta location ''" in SQL in another step. (Each parquet file is in its own…
1
vote
1 answer

Azure-Databricks autoloader Binaryfile option with foreach() gives java.lang.OutOfMemoryError: Java heap space

I am trying to do copying file from one location to another location using BinaryFile option and foreach(copy) in autoloader. It runs well with smaller files(upto 150 MB) but fails with bigger files throws below exception : *22/09/07 10:25:51 INFO…
1
vote
1 answer

Auto Loader with Merge Into for multiple tables

I am trying to implement the Auto Loder using the Merge Into on multiple tables using the code below as stated in the documentation: def upsert_data(df, epoch_id): deltaTable = DeltaTable.forPath(spark, target_location)\ …
1
vote
1 answer

Azure databricks autoloader spark streaming unable to read input fil

I have setup streaming job using autoloader feature and input is located at azure adls gen2 in parquet format.below is the code. df = spark.readStream.format("cloudFiles")\ .options(**cloudfile)\ …
1
vote
2 answers

Databricks Autoloader throws IllegalArgumentException

I'm trying the simplest auto loader example included in the databricks website https://databricks.com/notebooks/Databricks-Data-Integration-Demo.html df = (spark.readStream.format("cloudFiles") .option("cloudFiles.format", "json") …
hikizume
  • 578
  • 11
  • 25
1
vote
0 answers

Implementing CDC with autoloader in databricks with python not returning new inserted rows

I am not able to capture the newly inserted rows in the dataframe. I have researhed about it and found nothing done in python only in SQL. #implementing autoloader autoloader_df1 = (spark.readStream.format("cloudFiles") …
1
vote
2 answers

Databricks Autoloader Files Process issue

I've zip files in my container and I would get one or more files everyday and as they come in, I want to process the files. I have some questions. Can I use Databricks autoloader feature to process zip files? Is zip file supported by…
1
vote
1 answer

Streaming job finish before write incremental data

I'm having a problem with a stream job with trigger.once When I run it for the first time, it works fine, write all available data on the path and finish. But on the next day, when there is new data available in the original path, the stream doesn't…
1
vote
2 answers

how to add traceability columns autoloader - adf integration?

I am using Azure data factory to copy source data into landing zone (adls gen2) and then using auto-loader to load into bronze delta tables. everything works perfectly except I am not able to derive pipeline_name, runid and trigger_time as derived…
1
vote
1 answer

How to filter files in Databricks Autoloader stream

I want to set up an S3 stream using Databricks Auto Loader. I have managed to set up the stream, but my S3 bucket contains different type of JSON files. I want to filter them out, preferably in the stream itself rather than using a filter…
k88
  • 1,858
  • 2
  • 12
  • 33
1
vote
1 answer

Ingest CSV data with Auto Loader with Specific Delimiters / separator

I'm trying to load a several csv files with a complex separator("~|~") The current code currently loads the csv files but is not identifying the correct columns because is using the separator (","). I'm reading the documentation here…
0
votes
0 answers

Databricks Autoloader throws conversion error instead of writing the value to the rescue column

I'm reading parquet files and try to load them into the target Delta table, using such code: (spark.readStream .format("cloudFiles") .option("cloudFiles.format", "parquet") .option("cloudFiles.schemaLocation", checkpoint_path) …
0
votes
0 answers

Stopping the Databricks auto loader based on the foreachbatch completes for the particular run

we are running Databricks autoloader streaming job by running 24*7. The approach we are trying to follow is to stop the job on weekends and run the vacuum and optimize commands. But not sure how we can stop the job based on the foreachbatch…