Questions tagged [databricks-autoloader]

69 questions
0
votes
0 answers

Modify read data before writing in Databricks Autoloader

I'm implementing streaming reading from one dataset and writing to another dataset using Databricks Autoloader. How can I apply some custom modification code to the read data before writing? E.g. something like this: def my_modification(df): …
0
votes
0 answers

Databricks Autoloader SchemaHints

I am using a Autoloader to load CSV data from a S3 bucket and I am executing the autoloader query using a DLT. My DLT works fine and it creates the table based on the query but when it creates the table all the fields seems to be of 'string'…
Ananya
  • 75
  • 5
0
votes
0 answers

Databricks: autoloader and multiple files with differing schema?

I'm following the Databricks Cloud tutorial. I see sample data located at…
notaorb
  • 1,944
  • 1
  • 8
  • 18
0
votes
1 answer

Autoloader filter duplicates

Im heaving streaming dataframe and wonder how can I eliminate duplicataes plus select only latest modifiedon row. For example. id modifiedon 1 03/08/2023 1 03/08/2023 2 02/08/2023 2 03/08/2023 Desired…
0
votes
1 answer

Databricks Autoloader multiple folders

I have a hard time to understand how autoloader will work with multiple folders in adls gen 2 and how should I pass the data_source path. I have the following folder strcutre, where data is loading for multiple tables in evey 15 min in my storage…
Greencolor
  • 501
  • 1
  • 5
  • 16
0
votes
0 answers

Reuse auto loader in different storages

I have two storage accounts on Azure old storage new storage Some data in old storage are ingested by auto loader and work well. But now, I'm moving the data from old storage to new storage, including the auto loader with checkpoints, etc, but…
0
votes
1 answer

Generating "load_date" column in azure data lake from RAW to Bronze with Autolaoder for batch ingestion

I am ingesting data from RAW layer (ADLS gen2) to Bronze layer with databricks using Autoloader. These are not real time data but batch data and everyday we get new files in the raw path which comes via adf. Now for one of the dataset i am doing a…
0
votes
0 answers

WriteStream stopping when RDD is empty

I have an autoloader stream: streaming_df = ( spark.readStream.format("cloudFiles") .option("cloudFiles.schemaLocation", checkpoint_path) \ .option("cloudFiles.format", "avro") .load(source_path) ) json_string_df =…
Duccio Borchi
  • 209
  • 4
  • 13
0
votes
1 answer

schema mismatch error in databricks while reading file from storage account

I have below script which I run in my unity catalog enabled databricks workspace and get the below error. The schema and code worked for my other tenant in different workspace and I was hoping it was same for this tenant. now I dont have time to…
0
votes
0 answers

FileDiscovery in Autoloader Databricks for streaming job, Glob Patterns not working

I have a databricks streaming job which used autoloader for File Discovery but the problem is its unable to list the files according to the Glob pattern I have provided Right now the Raw zone of our files contain data from 24th March 2023 till today…
Arpan Sarkar
  • 39
  • 1
  • 6
0
votes
0 answers

Databricks Autoloader not saving data

I am very new to Databricks Autoloader. I am trying to ingest a simple csv file with 3 records with the format [Fname, Lname, age]. The following code runs successfully in Databricks, but no data is getting saved. I'm sure I am missing something…
marie20
  • 723
  • 11
  • 30
0
votes
0 answers

Autoloader checkpoint preservation

Is it possible to restore the contents of the checkpoint location after table alteration of a non-empty table ? I am using Databricks Autoloader to load a table. I need to update the data-type of one of the columns. But, I believe this won't be…
marie20
  • 723
  • 11
  • 30
0
votes
0 answers

How does Databricks Autoloader split data in microbatches?

Based on this, Databricks Runtime >= 10.2 supports the "availableNow" trigger that can be used in order to perform batch processing in smaller distinct microbatches, whose size can be configured either via total number of files (maxFilesPerTrigger)…
0
votes
1 answer

read databricks json with column value is base64 with Autoloader and inferschema

I have JSON files falling in our blob with two fields: offset (integer) value (base64) This value column is JSON with unicode (and that's why it's base64-encoded). { "offset": 1, "value": "eyJfaWQiOiAiNjQxY2I3MWQyY...a very long base64-encoded…
0
votes
0 answers

XML streaming using Autoloader in Azure databricks

I am trying to use readstream using binary format with respect to xml in Azure databricks. rootTag = "Message" inputPath ='/mnt/xyz//1.0/20220401/*.xml' df = spark.read.format('com.databricks.spark.xml').option("rowtag" ,…