Questions tagged [databricks]

Databricks is a unified platform with tools for building, deploying, sharing, and maintaining enterprise-grade data and AI solutions at scale. The Databricks Lakehouse Platform integrates with cloud storage and security in your cloud account, and manages and deploys cloud infrastructure on your behalf. Databricks is available on AWS, Azure, and GCP. Use this tag for questions related to the Databricks Lakehouse Platform.

Use this tag for questions specific to Databricks Lakehouse Platform, including, but not limited to Databricks file system, REST APIs, Databricks Spark SQL extensions and orchestrating tools.

Don't use this tag for generic questions about apache-spark or public Spark packages maintained by Databricks (like spark-csv).

Related tags:

7135 questions

votes

1 answer

How to plot correlation heatmap when using pyspark+databricks

I am studying pyspark in databricks. I want to generate a correlation heatmap. Let's say this is my data: myGraph=spark.createDataFrame([(1.3,2.1,3.0), (2.5,4.6,3.1), (6.5,7.2,10.0)], …

ggplot2 pyspark heatmap correlation databricks

asked Apr 06 '19 at 06:25

Feng Chen

2,139
4
33
62

votes

1 answer

How to pass a python variables to shell script in azure databricks notebookbles.?

How to pass a python variables from %python cmd to shell script %sh,in azure databricks notebook..?

python azure variables databricks

asked Feb 13 '19 at 04:37

Tony

votes

1 answer

Databricks-GitHub integration, automatically add all notebooks to repository

I'm trying to set up GitHub integration for Databricks. We have hundreds of notebooks there, and it would be exhausting to add every notebook manually to the repo. Is there some way to automatically commit and push all notebooks from databricks to…

git github automation databricks

asked Nov 06 '18 at 09:47

Viacheslav Shalamov

4,149
6
44
66

votes

3 answers

In Databricks, check whether a path exist or not

I am reading CSV files from datalake store, for that I am having multiple paths but if any one path does not exist it gives exception. I want to avoid this expection.

csv exception path load databricks

asked Oct 30 '18 at 12:11

Bilal Shafqat

votes

1 answer

Azure Databricks vs ADLA for processing

Presently, I have all my data files in Azure Data Lake Store. I need to process these files which are mostly in csv format. The processing would be running jobs on these files to extract various information for e.g.Data for certain periods of dates…

azure azure-data-lake u-sql databricks

asked Sep 14 '18 at 19:43

Jobi

votes

3 answers

Spark Parallelism in Standalone Mode

I'm trying to run spark in standalone mode in my system. The current specification of my system is 8 cores and 32 Gb memory. Base on this article I calculate the spark configurations as the following: spark.driver.memory 2g spark.executor.cores…

apache-spark pyspark databricks

asked Jul 22 '17 at 12:51

Beta

1,638
5
33
67

votes

3 answers

Databricks - Pyspark vs Pandas

I have a python script where I'm using pandas for transformations/manipulation of my data. I know I have some "inefficient" blocks of code. My question is, if pyspark is supposed to be much faster, can I just replace these blocks using pyspark…

python apache-spark pyspark databricks

asked Nov 30 '21 at 23:41

chicagobeast12

votes

0 answers

Training multiple word embedding models with PySpark getting stuck

Very excited to finally post my first question, but please nudge me if I am unclear or violate the standard etiquette. I sincerely appreciate any help that I can get. I am attempting to use PySpark (in Databricks) to train embeddings for many…

python apache-spark pyspark databricks

asked Jul 29 '21 at 02:42

rcarroll901

votes

1 answer

How can I use databricks utils functions in PyCharm? I can't find appropriate pip package

PyCharm IDE. I want to use dbutils.widgets.get() in a module and than to import this module to databricks. I already tried with pip install databricks-client pip install databricks-utils and pip install DBUtils

python apache-spark pyspark pycharm databricks

asked Jul 14 '21 at 09:46

Borislav Blagoev

votes

2 answers

High Concurrency Clusters in Databricks

This from Databricks docs: High Concurrency clusters A High Concurrency cluster is a managed cloud resource. The key benefits of High Concurrency clusters are that they provide Apache Spark-native fine-grained sharing for maximum resource…

scala apache-spark databricks

asked Jan 24 '21 at 10:18

thebluephantom

16,458
8
40
83

votes

1 answer

Running into 'java.lang.OutOfMemoryError: Java heap space' when using toPandas() and databricks connect

I'm trying to transform a pyspark dataframe of size [2734984 rows x 11 columns] to a pandas dataframe calling toPandas(). Whereas it is working totally fine (11 seconds) when using an Azure Databricks Notebook, I run into a…

python pandas pyspark databricks databricks-connect

asked Dec 09 '20 at 17:42

petzholt

votes

1 answer

databricks cli: getting b'Bad request error

I am trying to use Databricks CLI for the first time. Whenever I try something using cli it gives me the message: "Error: b'Bad Request'" This is same for any cli based command I am able to do authenticate (Tried with a wrong token and got the…

command-line-interface databricks azure-databricks

asked Nov 16 '20 at 17:32

sachin mathew jose

votes

3 answers

Databricks CLI: SSLError, can't find local issuer certificate

I have installed and configured the Databricks CLI, but when I try using it I get an error indicating that it can't find a local issuer certificate: $ dbfs ls dbfs:/databricks/cluster_init/ Error: SSLError:…

python ssl databricks databricks-cli

asked Sep 14 '20 at 16:06

James Adams

8,448
21
89
148

votes

2 answers

Spark schema management at single place

Question What is the best way to manage Spark tables' schemas? Do you see any drawbacks of Option 2? May you suggest any better alternatives? Solutions I see Option 1: keep separate definitions for code and for metastore The drawback of this is…

apache-spark hive databricks aws-glue

asked Aug 14 '20 at 21:04

VB_

45,112
42
145
293

votes

3 answers

Python Version in Azure Databricks

I am trying to find out the python version I am using in Databricks. To find out I tried import sys print(sys.version) And I got the output as 3.7.3 However when I went to Cluster --> SparkUI --> Environment I see that the cluster Python version…

python databricks azure-databricks

asked Jun 10 '20 at 12:47

learner

Prev 1 2 3

…

99 100 Next