Questions tagged [databricks]

Databricks is a unified platform with tools for building, deploying, sharing, and maintaining enterprise-grade data and AI solutions at scale. The Databricks Lakehouse Platform integrates with cloud storage and security in your cloud account, and manages and deploys cloud infrastructure on your behalf. Databricks is available on AWS, Azure, and GCP. Use this tag for questions related to the Databricks Lakehouse Platform.

Use this tag for questions specific to Databricks Lakehouse Platform, including, but not limited to Databricks file system, REST APIs, Databricks Spark SQL extensions and orchestrating tools.

Don't use this tag for generic questions about or public Spark packages maintained by Databricks (like ).

Related tags:

7135 questions
9
votes
1 answer

How to plot correlation heatmap when using pyspark+databricks

I am studying pyspark in databricks. I want to generate a correlation heatmap. Let's say this is my data: myGraph=spark.createDataFrame([(1.3,2.1,3.0), (2.5,4.6,3.1), (6.5,7.2,10.0)], …
Feng Chen
  • 2,139
  • 4
  • 33
  • 62
9
votes
1 answer

How to pass a python variables to shell script in azure databricks notebookbles.?

How to pass a python variables from %python cmd to shell script %sh,in azure databricks notebook..?
Tony
  • 142
  • 2
  • 8
9
votes
1 answer

Databricks-GitHub integration, automatically add all notebooks to repository

I'm trying to set up GitHub integration for Databricks. We have hundreds of notebooks there, and it would be exhausting to add every notebook manually to the repo. Is there some way to automatically commit and push all notebooks from databricks to…
Viacheslav Shalamov
  • 4,149
  • 6
  • 44
  • 66
9
votes
3 answers

In Databricks, check whether a path exist or not

I am reading CSV files from datalake store, for that I am having multiple paths but if any one path does not exist it gives exception. I want to avoid this expection.
Bilal Shafqat
  • 689
  • 2
  • 14
  • 26
9
votes
1 answer

Azure Databricks vs ADLA for processing

Presently, I have all my data files in Azure Data Lake Store. I need to process these files which are mostly in csv format. The processing would be running jobs on these files to extract various information for e.g.Data for certain periods of dates…
Jobi
  • 93
  • 1
  • 3
9
votes
3 answers

Spark Parallelism in Standalone Mode

I'm trying to run spark in standalone mode in my system. The current specification of my system is 8 cores and 32 Gb memory. Base on this article I calculate the spark configurations as the following: spark.driver.memory 2g spark.executor.cores…
Beta
  • 1,638
  • 5
  • 33
  • 67
8
votes
3 answers

Databricks - Pyspark vs Pandas

I have a python script where I'm using pandas for transformations/manipulation of my data. I know I have some "inefficient" blocks of code. My question is, if pyspark is supposed to be much faster, can I just replace these blocks using pyspark…
chicagobeast12
  • 643
  • 1
  • 5
  • 20
8
votes
0 answers

Training multiple word embedding models with PySpark getting stuck

Very excited to finally post my first question, but please nudge me if I am unclear or violate the standard etiquette. I sincerely appreciate any help that I can get. I am attempting to use PySpark (in Databricks) to train embeddings for many…
rcarroll901
  • 141
  • 8
8
votes
1 answer

How can I use databricks utils functions in PyCharm? I can't find appropriate pip package

PyCharm IDE. I want to use dbutils.widgets.get() in a module and than to import this module to databricks. I already tried with pip install databricks-client pip install databricks-utils and pip install DBUtils
Borislav Blagoev
  • 187
  • 5
  • 15
8
votes
2 answers

High Concurrency Clusters in Databricks

This from Databricks docs: High Concurrency clusters A High Concurrency cluster is a managed cloud resource. The key benefits of High Concurrency clusters are that they provide Apache Spark-native fine-grained sharing for maximum resource…
thebluephantom
  • 16,458
  • 8
  • 40
  • 83
8
votes
1 answer

Running into 'java.lang.OutOfMemoryError: Java heap space' when using toPandas() and databricks connect

I'm trying to transform a pyspark dataframe of size [2734984 rows x 11 columns] to a pandas dataframe calling toPandas(). Whereas it is working totally fine (11 seconds) when using an Azure Databricks Notebook, I run into a…
petzholt
  • 113
  • 1
  • 8
8
votes
1 answer

databricks cli: getting b'Bad request error

I am trying to use Databricks CLI for the first time. Whenever I try something using cli it gives me the message: "Error: b'Bad Request'" This is same for any cli based command I am able to do authenticate (Tried with a wrong token and got the…
8
votes
3 answers

Databricks CLI: SSLError, can't find local issuer certificate

I have installed and configured the Databricks CLI, but when I try using it I get an error indicating that it can't find a local issuer certificate: $ dbfs ls dbfs:/databricks/cluster_init/ Error: SSLError:…
James Adams
  • 8,448
  • 21
  • 89
  • 148
8
votes
2 answers

Spark schema management at single place

Question What is the best way to manage Spark tables' schemas? Do you see any drawbacks of Option 2? May you suggest any better alternatives? Solutions I see Option 1: keep separate definitions for code and for metastore The drawback of this is…
VB_
  • 45,112
  • 42
  • 145
  • 293
8
votes
3 answers

Python Version in Azure Databricks

I am trying to find out the python version I am using in Databricks. To find out I tried import sys print(sys.version) And I got the output as 3.7.3 However when I went to Cluster --> SparkUI --> Environment I see that the cluster Python version…
learner
  • 833
  • 3
  • 13
  • 24