Questions tagged [azure-databricks]

For questions about the usage of Databricks Lakehouse Platform on Microsoft Azure

Overview

Azure Databricks is the Azure-based implementation of Databricks, which is a high-level platform for working with Apache Spark and includes Jupyter-style notebooks.

Azure Databricks is a first class Azure service and natively integrates with other Azure services such as Active Directory, Blob Storage, Cosmos DB, Data Lake Store, Event Hubs, HDInsight, Key Vault, Synapse Analytics, etc.

Related Tags

4095 questions
1
vote
1 answer

How to access different storage accounts with same container name in databricks notebooks

I have 2 different storage accounts with same container name. Lets say tenant1 and tenant2 as storage account name with "appdata" as container name in both accounts. I can create and mount both containers to dbfs. But i am unable to read/write…
1
vote
1 answer

Databricks rename a folder

I am reading the data from a folder /mnt/lake/customer where mnt/lake is the mount path referring to ADLS Gen 2, Now I would like to rename a folder from /mnt/lake/customer to /mnt/lake/customeraddress without copying the data from one folder to…
Kumar
  • 57
  • 1
  • 1
  • 5
1
vote
1 answer

Get last modified date of Folders and Files in Azure Databricks

I need to get last modified dates of all Folders and Files in DBFS mount point (of ADLS Gen1) under Azure Databricks. Folder structure is like: Not containing any files, Empty…
Gopesh
  • 195
  • 1
  • 3
  • 17
1
vote
2 answers

Pyspark DataFrame - Escaping &

I have some large (~150 GB) csv files using semicolon as the separator character. I have found that some of the fields contain a html encoded ampersand & The semicolon is getting picked up as a column separator so I need to way to escape it or…
Connell.O'Donnell
  • 3,603
  • 11
  • 27
  • 61
1
vote
0 answers

Design stream pipeline using spark structured streaming and databricks delta to handle multiple tables

I am designing a streaming pipeline where my need is to consume events from Kafka topic. Single Kafka topic can have data from around 1000 tables, data coming as a json record. Now I have below problems to solve. Reroute messages based on its table…
1
vote
2 answers

Azure Databricks - Generate SQL Select Statement with Columns

I have tables in Azure Databricks that I am using SQL to interact with via a notebook. I need to select all columns from a table with 200 columns, I need to select all of them but I need to modify some for a select insert (To modify specific columns…
GLowe
  • 21
  • 5
1
vote
1 answer

Need to decompress an archive from Azure Databricks, using Python

i'm using a code to decompress an archive coming from a blob storage and this code is already functional for another archive that has 300mb, but while trying to decompress another one bigger than this, i've got this error: "NotImplementedError: That…
JowOfBeco
  • 11
  • 2
1
vote
1 answer

How to export files generated to Azure DevOps from Azure Databricks after a job terminates?

We are using Azure DevOps to submit a Training Job to Databricks. The training job uses a notebook to train a Machine Learning Model. We are using databricks CLI to submit the job from ADO. In the notebook, in of the steps, we create a .pkl file, we…
1
vote
1 answer

spark data frame Schema With Data Definitions

I'm trying to add comments to the field (Schema With Data Definitions), below is the implementation I'm trying. Tried to with StructType.add() (code in comments) and also with StructType([ StructField("filed",dtype,boolean,metadata )] got below…
1
vote
1 answer

DataBricks cannot show data from Data Lake gen 2

We're migrating from blob storage to ADLS Gen 2 and we want to test the access to Data Lake from DataBricks. I created a service principal which has Blob Storage Reader and Blob Storage Contributor access to Data Lake. My notebook sets the below…
Morez
  • 2,085
  • 2
  • 10
  • 33
1
vote
0 answers

Databricks: dbutils.fs.mv() throws java.io.FileNotFoundException although file exists

I have a function to replace parquet files in ADLS Gen2: def replace_parquet_file(df: DataFrame, path: str): path_new = path + '_new' path_old = path + '_old' if not file_exists(path): df.write.mode('overwrite').parquet(path) else: …
1
vote
0 answers

How to correctly tune the Spark cluster executor memory garbage collection?

I have an Azure Databricks Spark cluster consisting of 6 nodes (5 workers + 1 driver) of 16 cores & 64GB memory each. I'm running a PySpark notebook that: reads a DF from parquet files. caches it (df.cache()). executes an action on it…
lqrz
  • 93
  • 9
1
vote
2 answers

How to export a MLFlow Model from Azure Databricks as an Azure DevOps Artifacts for CD phase?

I am trying to create an MLOps Pipeline using Azure DevOps and Azure Databricks. From Azure DevOps, I am submitting a Databricks job to a cluster, which trains a Machine Learning Model and saves it into MLFlow Model Registry with a custom flavour…
Anirban Saha
  • 1,350
  • 2
  • 10
  • 38
1
vote
1 answer

Move Files from Azure Files to ADLS Gen 2 and Back using Databricks

I have a Databricks process which currently generate a bunch of text files which gets stored in Azure Files. These files need to be moved to ADLS Gen 2 on a scheduled basis and back to File Share. How this can be achieved using Databricks?
1
vote
1 answer

Databricks - "Alter Table Owner to userid" is not working with Spark.sql in Pyspark notebook

I am trying to run the below command in spark sql in my pyspark notebook (databricks) and it is getitng an error but the same command is working in sql notebook. ALTER TABLE sales.product OWNER TO `john001@mycomp.com`; Pyspark Code below …
1 2 3
99
100