Questions tagged [aws-databricks]

For questions about the usage of Databricks Lakehouse Platform on AWS cloud.

Databricks Lakehouse Platform on AWS

Lakehouse Platform for accelerating innovation across data science, data engineering, business analytics, and data warehousing integrated with your AWS infrastructure.

Reference: https://databricks.com/aws

190 questions
3
votes
1 answer

Load files in order with Databricks autoloader

I'm trying to write a python pipeline in Databricks to take CDC data from a postgres, dumped by DMS into s3 as parquet files and ingest it. The file names are numerically ascending unique ids based on datatime (ie 20220630-215325970.csv). Right now…
3
votes
1 answer

Azure Databricks Architecture - Communication between Control plane and data plane and authentications

I am trying to understand on Azure Databricks Architecture based on the this link. I could understand what is the purpose of control plane and data plane in Azure Databricks architecture.But I could't understand on the following questions . How…
3
votes
1 answer

StreamingQuery Delta Tables within Databricks - Describe History

I have a Delta Table which I am reading as StreamingQuery. Looking through the Delta Table History, using DESCRIBE History, I am seeing that 99% of the OperationMetrics states that numTargetRowsUpdates is 0 with most operations being Inserts.…
3
votes
2 answers

Databricks Notebook 8.3 (Apache Spark 3.1.1, Scala 2.12) | pyspark | Parquet write exception | Multiple failures in stage materialization

This is a Production code running fine until last week. Then, this parquet write error showed up and never getting resolved. While writing to AWS S3 in parquet format, I tried several dataframe.repartitions(300) - 300, 500, 2400, 6000. But no luck.…
3
votes
1 answer

How to configure a custom Spark Plugin in Databricks?

How to properly configure Spark plugin and the jar containing the Spark Plugin class in Databricks? I created the following Spark 3 Plugin class in Scala, CustomExecSparkPlugin.scala: package example import org.apache.spark.api.plugin.{SparkPlugin,…
3
votes
3 answers

Aws S3 to Databricks mount is not working

I have mounted 'mybucket' using mount commands and i could able to list all the objects using the below command- %fs ls /mnt/mybucket/ however, i have folders inside the folders in 'mybucket' and i want to run the below command but it is not…
3
votes
0 answers

Databricks: Difference between dbfs:/ vs file:/

I am trying to understand the way Databricks stores files and I am a bit unsure of what the difference is between dbfs:/ and file:/ (see image below) From what I have been able to deduce from here, file:/ seems to be the area where external files…
Neal
  • 328
  • 5
  • 12
3
votes
3 answers

Can't Access /dbfs/FileStore using shell commands in databricks runtime version 7

In databricks runtime version 6.6 I am able to successfully run a shell command like the following: %sh ls /dbfs/FileStore/tables However, in runtime version 7, this no longer works. Is there any way to directly access /dbfs/FileStore in runtime…
2
votes
1 answer

Can we execute a single task in isolation from a multi task Databricks job

Can we execute a single task in isolation from a multi-task Databricks job?
soumya-kole
  • 1,111
  • 7
  • 18
2
votes
1 answer

Cross Job Dependencies in Databricks Workflow

I am trying to create a data pipeline in Databricks using Workflows UI. I have significant number of tasks which I wanted to split across multiple jobs and have dependencies defined across them. But it seems like, in Databricks there cannot be cross…
Abhishek
  • 83
  • 10
2
votes
1 answer

Using code_path in mlflow.pyfunc models on Databricks

We are using Databricks over AWS infra, registering models on mlflow. We write our in-project imports as from src.(module location) import (objects). Following examples online, I expected that when I use mlflow.pyfunc.log_model(...,…
perfects
  • 21
  • 2
2
votes
1 answer

Databricks how to exit the entire 'job' in the notebooks orchestration workflow?

Say I have a simple notebook orchestration : Notebook A -> Notebook B Notebook A finish first then trigger Notebook B I am wondering if there is an out of box method to allow Notebook A to terminate the entire job? (without running Notebook…
QPeiran
  • 1,108
  • 1
  • 8
  • 18
2
votes
1 answer

Where does the databricks cluster runs when I created a cluster through UI in databricks?

I am new to databricks, I am confused after creating a cluster in databricks. Here databricks asked me to connect AWS account before creating a workspace and I did. Then I created a cluster. Now I want to know that, where does the cluster runs. Is…
2
votes
3 answers

AWS Databricks pricing - should we also pay for EC2 instances seperately, in addition to DBU costs?

am trying to do some cost comparison between AWS Glue and Databricks hosted on an AWS environment. For the comparison, I have chosen m4.xlarge which is equivalent of 1 DPU in AWS Glue (4 vCPUs/16GB memory). Assuming I have an pyspark job thats…
Yuva
  • 2,831
  • 7
  • 36
  • 60
2
votes
2 answers

AWS Databricks cluster start failure

I am currently unable to spin up any clusters in our databricks AWS environment. When I attempt to start up an on-demand cluster, it remains in "pending" for 20+ minutes (on relatively small clusters which usually take 2-3 min to start…
wylie
  • 173
  • 11
1
2
3
12 13