Questions tagged [databricks]

Databricks is a unified platform with tools for building, deploying, sharing, and maintaining enterprise-grade data and AI solutions at scale. The Databricks Lakehouse Platform integrates with cloud storage and security in your cloud account, and manages and deploys cloud infrastructure on your behalf. Databricks is available on AWS, Azure, and GCP. Use this tag for questions related to the Databricks Lakehouse Platform.

Use this tag for questions specific to Databricks Lakehouse Platform, including, but not limited to Databricks file system, REST APIs, Databricks Spark SQL extensions and orchestrating tools.

Don't use this tag for generic questions about apache-spark or public Spark packages maintained by Databricks (like spark-csv).

Related tags:

7135 questions

votes

1 answer

Error logging in python not working with azure databricks

Question related to this problem was not answered by anyone I tried implementing error logging using python in azure data bricks. If i try the below code in python(pycharm) it is working as expected. But when i try the same code in azure…

azure databricks error-logging

asked Jun 05 '19 at 19:29

Lucky

votes

3 answers

Unable to Create Extract - Tableau and Spark SQL

I am trying to make extract information from Spark SQL. Following error message showing while creating extract. [Simba][Hardy] (35) Error from server: error code: '0' error message: 'org.apache.spark.SparkException: Job aborted due to stage…

extract tableau-api apache-spark-sql databricks

asked May 30 '19 at 10:03

Niks

votes

2 answers

Databricks not updating in SQL query

I am trying to replace special characters from a table column using SQL a SQL query. However, I get the following error. Can anyone tell me what I did wrong or how I should approach this? SQL QUERY UPDATE wine SET description = REPLACE(description,…

python sql databricks

asked May 29 '19 at 19:26

ApplePie

1,155
3
12
30

votes

2 answers

Is it possible to specify MLflow project Environment through a Dockerfile (instead of an image)?

To my understanding, currently (May 2019) mlflow support running project in docker environment; however, it needs the docker image already been built. This leaves the docker image building to be a separate workflow. What is the suggested way to run…

docker machine-learning artificial-intelligence databricks mlflow

asked May 13 '19 at 21:20

Bin

3,645
10
33
57

votes

1 answer

databricks-cli: JSONDecodeError when running job in bash script

I am trying to run a Databricks job with notebook parameters within a bash script on a Linux server. I am following instructions from the docs and I have verified that the commands work in terminal. Here is my script: #!/bin/bash ### this commands…

json bash databricks databricks-cli

asked May 08 '19 at 13:29

Korean_Of_the_Mountain

1,428
3
16
40

votes

0 answers

DataFilters in Spark explain physical plans in Databricks but not on local machine

I am running the same query on the same dataset with the same Spark version (2.4.0) in two different environments - the explain plan includes DataFilters in the output in the Databricks environment, but doesn't include this in the output on my local…

apache-spark amazon-s3 databricks

asked Apr 30 '19 at 13:31

Powers

18,150
10
103
108

votes

2 answers

Implementing K-medoids in Pyspark

I can not find a library to use PAM (K-medoids) in Pyspark. I have found this in Scala : https://gist.github.com/erikerlandson/c3c35f0b1aae737fc884 And this issue in Spark which was resolved in 2016 :…

pyspark databricks

asked Apr 26 '19 at 13:05

Laure Decaudin

votes

1 answer

Write the results of the Google Api to a data lake with Databricks

I am getting back user usage data from the Google Admin Report User Usage Api via the Python SDK on Databricks. The data size is around 100 000 records per day which I do a night via a batch process. The api returns a max page size of 1000 so I call…

python apache-spark azure-data-lake databricks google-api-python-client

asked Apr 11 '19 at 08:30

Rodney

5,417
7
54
98

votes

2 answers

python code to Unzip the zipped file in s3 server in databricks

Code is to unzip the zipped file present in s3 server. Code is running in databricks , python version :3 and pandas===0.19.0 zip_ref = zipfile.ZipFile(path,mode='r') the above line throws error as below. FileNotFoundError: [Errno 2] No such file or…

python amazon-s3 databricks

asked Apr 10 '19 at 12:43

Sukanya

votes

1 answer

How do I parallel write JSON files to a mounted directory using Spark in Databricks

I have an RDD of 50,000 JSON files that I need to write to a mounted directory in Spark (Databricks). The mounted path looks something like /mnt/myblob/mydata (using Azure). I tried the following, but it turns out that I can't use dbutils inside a…

apache-spark databricks azure-databricks

asked Apr 09 '19 at 13:18

Jane Wayne

8,205
17
75
120

votes

1 answer

Using a JAR dependency in a PySpark parallelized execution context

This is for a PySpark / Databricks project: I've written a Scala JAR library and exposed its functions as UDFs via a simple Python wrapper; everything works as it should in my PySpark notebooks. However, when I try to use any of the functions…

python scala apache-spark pyspark databricks

asked Apr 01 '19 at 18:32

dpq

9,028
10
49
69

votes

1 answer

PySpark - Getting BufferOverflowException while running dataframe.sql on CSV stored in S3

I was getting the BufferOverflowException when I tried Spark SQL query on CSV stored in S3. Here is the link to the CSV and the data schema. I am actually using GZIP compressed CSV in S3. from pyspark.sql.types import * schema = StructType([…

apache-spark amazon-s3 pyspark databricks minio

asked Mar 30 '19 at 07:11

c0degeas

votes

3 answers

How to retrieve derived classes as is from a Map?

I have to retrieve Derived class objects stored in a Map given the respective class name as key. As show below trait Caluclator class PreScoreCalculator(data:Seq[Int]) extends Caluclator class BenchMarkCalculator(data:Seq[Int]) extends…

scala databricks

asked Mar 22 '19 at 13:01

Shasu

votes

1 answer

Change temporary path for individual job from spark code

I have multiple jobs that I want to execute in parallel that append daily data into the same path using dynamic partitioning. The problem i am facing is the temporary path that get created during the job execution by spark. Multiple jobs end up…

apache-spark databricks

asked Mar 21 '19 at 15:39

techie

votes

1 answer

Spark getting current date in string

I have the code below to get the date in the proper format to then be able to append to a filename string. %scala // Getting the date for the file name import org.apache.spark.sql.functions.{current_timestamp, date_format} val dateFormat =…

apache-spark databricks

asked Mar 20 '19 at 18:08

Sauron

6,399
14
71
136

Prev 1 2 3

…

100