Questions tagged [databricks]

Databricks is a unified platform with tools for building, deploying, sharing, and maintaining enterprise-grade data and AI solutions at scale. The Databricks Lakehouse Platform integrates with cloud storage and security in your cloud account, and manages and deploys cloud infrastructure on your behalf. Databricks is available on AWS, Azure, and GCP. Use this tag for questions related to the Databricks Lakehouse Platform.

Use this tag for questions specific to Databricks Lakehouse Platform, including, but not limited to Databricks file system, REST APIs, Databricks Spark SQL extensions and orchestrating tools.

Don't use this tag for generic questions about or public Spark packages maintained by Databricks (like ).

Related tags:

7135 questions
2
votes
1 answer

Error logging in python not working with azure databricks

Question related to this problem was not answered by anyone I tried implementing error logging using python in azure data bricks. If i try the below code in python(pycharm) it is working as expected. But when i try the same code in azure…
Lucky
  • 21
  • 1
  • 3
2
votes
3 answers

Unable to Create Extract - Tableau and Spark SQL

I am trying to make extract information from Spark SQL. Following error message showing while creating extract. [Simba][Hardy] (35) Error from server: error code: '0' error message: 'org.apache.spark.SparkException: Job aborted due to stage…
Niks
  • 997
  • 2
  • 10
  • 28
2
votes
2 answers

Databricks not updating in SQL query

I am trying to replace special characters from a table column using SQL a SQL query. However, I get the following error. Can anyone tell me what I did wrong or how I should approach this? SQL QUERY UPDATE wine SET description = REPLACE(description,…
ApplePie
  • 1,155
  • 3
  • 12
  • 30
2
votes
2 answers

Is it possible to specify MLflow project Environment through a Dockerfile (instead of an image)?

To my understanding, currently (May 2019) mlflow support running project in docker environment; however, it needs the docker image already been built. This leaves the docker image building to be a separate workflow. What is the suggested way to run…
Bin
  • 3,645
  • 10
  • 33
  • 57
2
votes
1 answer

databricks-cli: JSONDecodeError when running job in bash script

I am trying to run a Databricks job with notebook parameters within a bash script on a Linux server. I am following instructions from the docs and I have verified that the commands work in terminal. Here is my script: #!/bin/bash ### this commands…
Korean_Of_the_Mountain
  • 1,428
  • 3
  • 16
  • 40
2
votes
0 answers

DataFilters in Spark explain physical plans in Databricks but not on local machine

I am running the same query on the same dataset with the same Spark version (2.4.0) in two different environments - the explain plan includes DataFilters in the output in the Databricks environment, but doesn't include this in the output on my local…
Powers
  • 18,150
  • 10
  • 103
  • 108
2
votes
2 answers

Implementing K-medoids in Pyspark

I can not find a library to use PAM (K-medoids) in Pyspark. I have found this in Scala : https://gist.github.com/erikerlandson/c3c35f0b1aae737fc884 And this issue in Spark which was resolved in 2016 :…
2
votes
1 answer

Write the results of the Google Api to a data lake with Databricks

I am getting back user usage data from the Google Admin Report User Usage Api via the Python SDK on Databricks. The data size is around 100 000 records per day which I do a night via a batch process. The api returns a max page size of 1000 so I call…
2
votes
2 answers

python code to Unzip the zipped file in s3 server in databricks

Code is to unzip the zipped file present in s3 server. Code is running in databricks , python version :3 and pandas===0.19.0 zip_ref = zipfile.ZipFile(path,mode='r') the above line throws error as below. FileNotFoundError: [Errno 2] No such file or…
Sukanya
  • 33
  • 1
  • 1
  • 9
2
votes
1 answer

How do I parallel write JSON files to a mounted directory using Spark in Databricks

I have an RDD of 50,000 JSON files that I need to write to a mounted directory in Spark (Databricks). The mounted path looks something like /mnt/myblob/mydata (using Azure). I tried the following, but it turns out that I can't use dbutils inside a…
Jane Wayne
  • 8,205
  • 17
  • 75
  • 120
2
votes
1 answer

Using a JAR dependency in a PySpark parallelized execution context

This is for a PySpark / Databricks project: I've written a Scala JAR library and exposed its functions as UDFs via a simple Python wrapper; everything works as it should in my PySpark notebooks. However, when I try to use any of the functions…
dpq
  • 9,028
  • 10
  • 49
  • 69
2
votes
1 answer

PySpark - Getting BufferOverflowException while running dataframe.sql on CSV stored in S3

I was getting the BufferOverflowException when I tried Spark SQL query on CSV stored in S3. Here is the link to the CSV and the data schema. I am actually using GZIP compressed CSV in S3. from pyspark.sql.types import * schema = StructType([…
c0degeas
  • 762
  • 9
  • 19
2
votes
3 answers

How to retrieve derived classes as is from a Map?

I have to retrieve Derived class objects stored in a Map given the respective class name as key. As show below trait Caluclator class PreScoreCalculator(data:Seq[Int]) extends Caluclator class BenchMarkCalculator(data:Seq[Int]) extends…
Shasu
  • 458
  • 5
  • 22
2
votes
1 answer

Change temporary path for individual job from spark code

I have multiple jobs that I want to execute in parallel that append daily data into the same path using dynamic partitioning. The problem i am facing is the temporary path that get created during the job execution by spark. Multiple jobs end up…
techie
  • 313
  • 1
  • 8
  • 23
2
votes
1 answer

Spark getting current date in string

I have the code below to get the date in the proper format to then be able to append to a filename string. %scala // Getting the date for the file name import org.apache.spark.sql.functions.{current_timestamp, date_format} val dateFormat =…
Sauron
  • 6,399
  • 14
  • 71
  • 136
1 2 3
99
100