Questions tagged [databricks]

Databricks is a unified platform with tools for building, deploying, sharing, and maintaining enterprise-grade data and AI solutions at scale. The Databricks Lakehouse Platform integrates with cloud storage and security in your cloud account, and manages and deploys cloud infrastructure on your behalf. Databricks is available on AWS, Azure, and GCP. Use this tag for questions related to the Databricks Lakehouse Platform.

Use this tag for questions specific to Databricks Lakehouse Platform, including, but not limited to Databricks file system, REST APIs, Databricks Spark SQL extensions and orchestrating tools.

Don't use this tag for generic questions about or public Spark packages maintained by Databricks (like ).

Related tags:

7135 questions
12
votes
4 answers

databricks: check if the mountpoint already mounted

How to check if the mount point is already mounted before mount in databricks python ?? dbutils.fs.mount Thanks
mytabi
  • 639
  • 2
  • 12
  • 28
12
votes
2 answers

Saving Matplotlib Output to DBFS on Databricks

I'm writing Python code on Databricks to process some data and output graphs. I want to be able to save these graphs as a picture file (.png or something, the format doesn't really matter) to DBFS. Code: import pandas as pd import matplotlib.pyplot…
KikiNeko
  • 261
  • 1
  • 3
  • 7
12
votes
1 answer

Exporting spark dataframe to .csv with header and specific filename

I am trying to export data from a spark dataframe to .csv file: df.coalesce(1)\ .write\ .format("com.databricks.spark.csv")\ .option("header", "true")\ .save(output_path) It is creating a file name…
Naresh Y
  • 271
  • 1
  • 4
  • 10
12
votes
6 answers

Databricks display() function equivalent or alternative to Jupyter

I'm in the process of migrating current DataBricks Spark notebooks to Jupyter notebooks, DataBricks provides convenient and beautiful display(data_frame) function to be able to visualize Spark dataframes and RDDs ,but there's no direct equivalent…
Luis Leal
  • 3,388
  • 5
  • 26
  • 49
12
votes
2 answers

Adding constant value column to spark dataframe

I am using Spark version 2.1 in Databricks. I have a data frame named wamp to which I want to add a column named region which should take the constant value NE. However, I get an error saying NameError: name 'lit' is not defined when I run the…
Gaurav Bansal
  • 5,221
  • 14
  • 45
  • 91
12
votes
1 answer

Spark dataframe save in single file on hdfs location

I have dataframe and i want to save in single file on hdfs location. i found the solution here Write single CSV file using spark-csv df.coalesce(1) .write.format("com.databricks.spark.csv") .option("header", "true") …
shikha dubey
  • 139
  • 1
  • 1
  • 5
12
votes
1 answer

How can I convert a pyspark.sql.dataframe.DataFrame back to a sql table in databricks notebook

I created a dataframe of type pyspark.sql.dataframe.DataFrame by executing the following line: dataframe = sqlContext.sql("select * from my_data_table") How can I convert this back to a sparksql table that I can run sql queries on?
Semihcan Doken
  • 776
  • 3
  • 10
  • 23
11
votes
2 answers

What are the major differences between S3 lake formation governed tables and databricks delta tables?

What are the major differences between S3 lake formation governed tables and databricks delta tables? they look pretty similar.
MGomez
  • 123
  • 1
  • 5
11
votes
1 answer

Switching between Databricks Connect and local Spark environment

I am looking to use Databricks Connect for developing a pyspark pipeline. DBConnect is really awesome because I am able to run my code on the cluster where the actual data resides, so it's perfect for integration testing, but I also want to be able…
casparjespersen
  • 3,460
  • 5
  • 38
  • 63
11
votes
6 answers

What is the correct way to install the delta module in python?

What is the correct way to install the delta module in python?? In the example they import the module from delta.tables import * but i did not find the correct way to install the module in my virtual env Currently i am using this spark param…
ofriman
  • 198
  • 1
  • 1
  • 9
11
votes
2 answers

Pass additional arguments to foreachBatch in pyspark

I am using foreachBatch in pyspark structured streaming to write each microbatch to SQL Server using JDBC. I need to use the same process for several tables, and I'd like to reuse the same writer function by adding an additional argument for table…
11
votes
1 answer

Error running Spark on Databricks: constructor public XXX is not whitelisted

I was using Azure Databricks and trying to run some example python code from this page. But I get this exception: py4j.security.Py4JSecurityException: Constructor public org.apache.spark.ml.classification.LogisticRegression(java.lang.String) is not…
lidong
  • 556
  • 1
  • 4
  • 20
11
votes
2 answers

Difference in usecases for AWS Sagemaker vs Databricks?

I was looking at Databricks because it integrates with AWS services like Kinesis, but it looks to me like SageMaker is a direct competitor to Databricks? We are heavily using AWS, is there any reason to add DataBricks into the stack or odes…
L Xandor
  • 1,659
  • 4
  • 24
  • 48
11
votes
2 answers

Unsupported literal type class scala.runtime.BoxedUnit

I am trying to filter a column of a dataframe read from oracle as below import org.apache.spark.sql.functions.{col, lit, when} val df0 = df_org.filter(col("fiscal_year").isNotNull()) When I do it I am getting below…
BdEngineer
  • 2,929
  • 4
  • 49
  • 85
11
votes
3 answers

Create a new cluster in Databricks using databricks-cli

I'm trying to create a new cluster in Databricks on Azure using databricks-cli. I'm using the following command: databricks clusters create --json '{ "cluster_name": "template2", "spark_version": "4.1.x-scala2.11" }' And getting back this…
Mor Shemesh
  • 2,689
  • 1
  • 24
  • 36