Questions tagged [databricks]

Databricks is a unified platform with tools for building, deploying, sharing, and maintaining enterprise-grade data and AI solutions at scale. The Databricks Lakehouse Platform integrates with cloud storage and security in your cloud account, and manages and deploys cloud infrastructure on your behalf. Databricks is available on AWS, Azure, and GCP. Use this tag for questions related to the Databricks Lakehouse Platform.

Use this tag for questions specific to Databricks Lakehouse Platform, including, but not limited to Databricks file system, REST APIs, Databricks Spark SQL extensions and orchestrating tools.

Don't use this tag for generic questions about or public Spark packages maintained by Databricks (like ).

Related tags:

7135 questions
10
votes
0 answers

Spark 2.4.0 - unable to parse ISO8601 string into TimestampType preserving ms

When trying to convert ISO8601 strings with time zone information into a TimestampType using a cast(TimestampType) only strings using the time zone format +01:00 is accepted. If the time zone is defined in the ISO8601 legal way +0100 (without the…
Molotch
  • 365
  • 7
  • 20
10
votes
2 answers

Pass variables from Scala to Python in Databricks

I'm using Databricks and trying to pass a dataframe from Scala to Python, within the same Scala notebook. I passed a dataframe from Python to Spark using: %python python_df.registerTempTable("temp_table") val scalaDF = table("temp_table") How do…
Ashley O
  • 1,130
  • 3
  • 21
  • 34
9
votes
2 answers

How to configure Spark to adjust the number of output partitions after a join or groupby?

I know you can set spark.sql.shuffle.partitions and spark.sql.adaptive.advisoryPartitionSizeInBytes. The former will not work with adaptive query execution, and the latter only works for the first shuffle for some reason, after which it just uses…
9
votes
4 answers

List databricks secret scope and find referred keyvault in azure databricks

How can we find existing secret scopes in databricks workspace. And which keyvault is referred by specific SecretScope in Azure Databricks?
tikiabbas
  • 119
  • 2
  • 3
  • 11
9
votes
1 answer

Databricks Delta Live Tables: Difference between STREAMING and INCREMENTAL

Is there a difference between CREATE STREAMING LIVE TABLE and CREATE INCREMENTAL LIVE TABLE? The documentation is mixed: For instance, STREAMING is used here, while INCREMENTAL is used here. I have tested both and so far I have not noticed any…
dwolfeu
  • 1,103
  • 2
  • 14
  • 21
9
votes
2 answers

Unable to make private java.nio.DirectByteBuffer(long,int) accessible

I'm using Python to access Databricks through databricks-connect. Behind the wall, this uses spark which is indeed java based so in order to use this, I need java. The JDK has been downloaded (version 14), set as JAVA_HOME env but when I run the…
anthino12
  • 770
  • 1
  • 6
  • 29
9
votes
2 answers

Delta table merge on multiple columns

i have a table which has primary key as multiple columns so I need to perform the merge logic on multiple columns DeltaTable.forPath(spark, "path") .as("data") .merge( finalDf1.as("updates"), "data.column1 = updates.column1 AND…
Tony
  • 301
  • 3
  • 10
9
votes
2 answers

Why is Spark creating multiple jobs for one action?

I noticed that when launching this bunch of code with only one action, I have three jobs that are launched. from typing import List from pyspark.sql import DataFrame from pyspark.sql.types import StructType, StructField, StringType from…
Nastasia
  • 557
  • 3
  • 22
9
votes
2 answers

Why is Pandas UDF not being parallelized?

I have data from many IoT sensors. For each particular sensor, there's only about 100 rows in the dataframe: the data is not skewed. I'm training an individual machine learning model for each sensor. I'm using pandas udf successfully to train and…
marcus
  • 91
  • 1
  • 5
9
votes
1 answer

Processing upserts on a large number of partitions is not fast enough

The Problem We have a Delta Lake setup on top of ADLS Gen2 with the following tables: bronze.DeviceData: partitioned by arrival date (Partition_Date) silver.DeviceData: partitioned by event date and hour (Partition_Date and Partition_Hour) We…
9
votes
2 answers

How do you get the run parameters and runId within Databricks notebook?

When running a Databricks notebook as a job, you can specify job or run parameters that can be used within the code of the notebook. However, it wasn't clear from documentation how you actually fetch them. I'd like to be able to get all the…
Scott H
  • 2,644
  • 1
  • 24
  • 28
9
votes
2 answers

Azure Data Explorer (ADX) vs Polybase vs Databricks

Question Today I discovered another Azure service called Azure Data Explorer (ADX). Sorry for such comparison of services, I have good understanding of all except ADX. I feel like there is a big functionality overlay, so want to know the exact role…
VB_
  • 45,112
  • 42
  • 145
  • 293
9
votes
1 answer

What is the pyspark equivalent of MERGE INTO for databricks delta lake?

The databricks documentation describes how to do a merge for delta-tables. In SQL the syntax MERGE INTO [db_name.]target_table [AS target_alias] USING [db_name.]source_table [] [AS source_alias] ON [ WHEN…
Erik
  • 755
  • 1
  • 5
  • 17
9
votes
3 answers

PySpark: How can I suppress %run output in PySpark cell when importing variables from another Notebook?

I am using multiple notebooks in PySpark and import variables across these notebooks using %run path. Every time I run the command, all variables that I displayed in the original notebook are being displayed again in the current notebook (the…
DataBach
  • 1,330
  • 2
  • 16
  • 31
9
votes
5 answers

Assign a variable a dynamic value in SQL in Databricks / Spark

I feel like I must be missing something obvious here, but I can't seem to dynamically set a variable value in Spark SQL. Let's say I have two tables, tableSrc and tableBuilder, and I'm creating tableDest. I've been trying variants on SET myVar…
Philip Kahn
  • 614
  • 1
  • 5
  • 22