Questions tagged [aws-databricks]

For questions about the usage of Databricks Lakehouse Platform on AWS cloud.

Databricks Lakehouse Platform on AWS

Lakehouse Platform for accelerating innovation across data science, data engineering, business analytics, and data warehousing integrated with your AWS infrastructure.

Reference: https://databricks.com/aws

190 questions
2
votes
1 answer

specify a database name in databricks sql connection parameters

I am using airflow 2.0.2 to connect with databricks using the airflow-databricks-operator. The SQL Operator doesn't let me specify the database where the query should be executed, so I have to prefix the table_name with database_name. I tried…
Pbd
  • 1,219
  • 1
  • 15
  • 32
2
votes
0 answers

How to set custom path for databricks mlflow artifacts on s3

I've created an empty experiments from databricks experiments console and given the path for my artifacts on s3 i.e. s3:///. When i run the scripts, the artifacts are stored at s3:////<32 char id>/artifacts/model-Elasticnet/model.pkl I want…
shahidammer
  • 1,026
  • 2
  • 10
  • 24
2
votes
1 answer

Data Lakes - S3 and Databricks

I understand Data Lake Zones in S3 and I am looking at establishing 3 zones - LANDING, STAGING, CURATED. If I were in an Azure environment, I would create the Data Lake and have multiple folders as various zones. How would I do the equivalent in AWS…
2
votes
1 answer

Why would dataframe.write.mode("overwrite").saveAsTable("table") command be dropping data?

%python dataframe.count() #output 1179 %python dataframe.write.mode("overwrite").saveAsTable("tablename") %sql select count(*) from tablename --output 1069 What can I be doing wrong? (these are different cells in databricks) I want to…
proutray
  • 1,943
  • 3
  • 30
  • 48
2
votes
1 answer

What is expected input date pattern for date_format function in databricks spark SQL

I am trying to better understand the date_format function offered by Spark SQL.As per the official databricks documentation (I am using databricks), this function expects any date/ string in a valid datetime format. Below is the link for the…
2
votes
0 answers

Databricks Spark throws java.io.NotSerializableException: com.amazonaws.services.s3.AmazonS3Client

Hi I am trying to run the following code on Databricks which is a 3 node spark cluster I retrieve the data from kinesis stream into a spark dataframe and transform it to extract the payload json file name In the below code I am trying to download…
2
votes
1 answer

Not able to display charts in Databricks when using a loop (not at end of cell)

I'm using a Databricks notebook. For various reasons, I need to render charts individually (concat doesn't give me the results I want) and I can't put the chart object at the end of the cell. I want to render each chart and do some processing.…
Mike Woodward
  • 211
  • 2
  • 10
2
votes
1 answer

Installing c libraries needed for R spatial packages on databricks clusters

Spatial packages in R often depend on C libraries for their numerical computation. This presents a problem when installing R packages that depend on these libraries if the R engine is unable to install these libraries using default permissions. It…
Cyrus Mohammadian
  • 4,982
  • 6
  • 33
  • 62
2
votes
2 answers

speeding up heavily partitioned dataframe to s3 on databricks

I'm running a notebook on Databricks which creates partitioned PySpark data frames and uploads them to s3. The table in question has ~5,000 files and is ~5 GB in total size (it needs to be partitioned in this way to be effectively queried by…
fez
  • 1,726
  • 3
  • 21
  • 31
1
vote
1 answer

Move managed DLT table from one schema to another schema in Databricks

I have a DLT table in schema A which is being loaded by DLT pipeline. I want to move the table from schema A to schema B, and repoint my existing DLT pipeline to table in schema B. also I need to avoid full reload in DLT pipeline on table in Schema…
Athi
  • 347
  • 4
  • 12
1
vote
0 answers

Using private python packages with databricks model serving

I am attempting to host a Python MLflow model using Databricks model serving. While the serving endpoint functions correctly without private Python packages, I am encountering difficulties when attempting to include them. Context: Without Private…
Eric
  • 795
  • 5
  • 21
1
vote
2 answers

How to dynamically change variables in a Databricks notebook based on to which environment was it deployed?

I want to move data from S3 bucket to Databricks. On both platforms I have separate environments for DEV, QA, and PROD. I use a Databricks notebook which I deploy to Databricks using terraform. Within the notebook there are some hardcoded variables,…
1
vote
2 answers

pyspark filtering column values using endswith

Hi I'm trying to filter some values of a column in a table using a function "endswith". The table looks like…
MMV
  • 164
  • 10
1
vote
1 answer

Error: cannot read mws workspaces: RESOURCE_DOES_NOT_EXIST: workspace 96783599 does not exist

When I do terraform apply. My workspace is getting created but I am getting the following error. I have looked to find the "workspace 96783599". But uanble to find the any resource with that number. Error: cannot read mws workspaces:…
1
vote
1 answer

Show table with multiple conditions in Databricks

I want to find tables in my databricks database that meet more than one condition. Mysql allows 'where' clauses to include multiple conditions like this post explains. To use multiple conditions in databricks, I can use the following syntax, but…
1 2
3
12 13