Questions tagged [databricks]

Databricks is a unified platform with tools for building, deploying, sharing, and maintaining enterprise-grade data and AI solutions at scale. The Databricks Lakehouse Platform integrates with cloud storage and security in your cloud account, and manages and deploys cloud infrastructure on your behalf. Databricks is available on AWS, Azure, and GCP. Use this tag for questions related to the Databricks Lakehouse Platform.

Use this tag for questions specific to Databricks Lakehouse Platform, including, but not limited to Databricks file system, REST APIs, Databricks Spark SQL extensions and orchestrating tools.

Don't use this tag for generic questions about or public Spark packages maintained by Databricks (like ).

Related tags:

7135 questions
2
votes
0 answers

Tensorflow estimator unknown Input/output error on Azure Databricks

I am trying to run the official BERT pretraining scripts based on this tutorial https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379 with the main exception that I am trying to use Azure Databricks. When I try to…
Usherwood
  • 359
  • 3
  • 11
2
votes
1 answer

Azure- Column has a data type that cannot participate in a columnstore index

I'm trying to load data to a table in Azure database using databricks. I get the following error. com.microsoft.sqlserver.jdbc.SQLServerException: The statement failed. Column 'MemberNumber' has a data type that cannot participate in a columnstore…
VinaySavanth
  • 21
  • 1
  • 2
2
votes
1 answer

What is the reason for inconsistent counts in Pyspark, Spark SQL and toPandas().shape?

I am working on databricks cloud 5.4 ML and I created a training dataset for my classification problem. When counting the records I get inconsistencies in the counts I cannot explain. Furthermore I have checked that my Spark DataFrame does not…
2
votes
1 answer

Is there any method in dbutils to check existence of a file, something like dbutils.fs.exits?

I wish to check if at certain location, say /dbfs/FileStore/tables/xyz.json exists or not. If yes, then the method should return true. I checked method in dbutils but doesn't seem to find any. Plus, I cannot mount any location in ADLS. What are the…
Pankaj Mishra
  • 151
  • 2
  • 10
2
votes
2 answers

AttributeError: module 'gensim.utils' has no attribute 'smart_open'

I am building the vocabulary table using Doc2vec, but there is an error "AttributeError: module 'gensim.utils' has no attribute 'smart_open'". How do I solve this? This is for a notebook on Databricks platform, running in Python 3. In the past, I've…
2
votes
0 answers

databricks error to copy and read file from to dbfs that is > 2gb

I have a csv of size 6GB. So far I was using the following line which when I check its size on dbfs after this copy using java io, it still shows as 6GB so I assume it was right. But when I do a spark.read.csv(samplePath) it reads only 18mn rows…
user3868051
  • 1,147
  • 2
  • 22
  • 43
2
votes
1 answer

How to use MLfLow with private git repositories?

I tested MLflow experiment when the source code is stored in public a git repository. Example command looks like this mlflow run https://github.com/amesar/mlflow-fun.git#examples/hello_world \ --experiment-id=2019 \ -Palpha=100…
Prince Bhatti
  • 4,671
  • 4
  • 18
  • 24
2
votes
1 answer

Using service principal to access blob storage from Databricks

I followed Access an Azure Data Lake Storage Gen2 account directly with OAuth 2.0 using the Service Principal and want to achieve the same but with blob storage general purpose v2 (with hierarchical fs disabled). Is it possible to get this working,…
czajek
  • 714
  • 1
  • 9
  • 23
2
votes
1 answer

Databricks, AzureCredentialNotFoundException

I have a High Concurency cluster with Active Directory integration turned on. Runtime: Latest stable (Scala 2.11), Python: 3. I've mounted Azure Datalake and when I want to read the data, always the first time after cluster start I…
Tomek
  • 41
  • 4
2
votes
1 answer

Execute a Databricks Notebook with PySpark code using Apache Airflow

I'm using Airflow, Databricks, and PySpark. I would like to know if it is possible to add more parameters when I want to execute a Databricks Notebook through Airflow. I had the next code in Python named MyETL: def main(**kwargs): …
Eric Bellet
  • 1,732
  • 5
  • 22
  • 40
2
votes
1 answer

Spark 2.4 CSV Load Issue with option "nullvalue"

We were using Spark 2.3 before, now we're on 2.4: Spark version 2.4.0 Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212) We had a piece of code running in production that converted csv files to parquet format. One of the options…
KK2486
  • 353
  • 2
  • 3
  • 13
2
votes
2 answers

Can't connect to Azure Data Lake Gen2 using PySpark and Databricks Connect

Recently, Databricks launched Databricks Connect that allows you to write jobs using Spark native APIs and have them execute remotely on an Azure Databricks cluster instead of in the local Spark session. It works fine except when I try to access…
flappy
  • 173
  • 1
  • 4
  • 12
2
votes
1 answer

How to rename, or even access a column with spaces in its name?

A table with one of the column names as 'Person Rank' is uploaded on Azure, and then accessed via Databricks Notebook. So writing sql statements using this column is giving errors, even renaming it is a problem. All the following commands give…
2
votes
1 answer

When running scala code in spark I get "Task not serializable" , why?

I'm trying to learn spark through an e-learning course "Apache Spark with scala" by Frank Kane. I use data bricks to run the code and when I run it I get "org.apache.spark.SparkException: Task not serializable". The code is given below (link to csv…
Beckenbaur93
  • 113
  • 9
2
votes
2 answers

How to send a list as parameter in databricks notebook task?

I am using Databricks Resi API to create a job with notebook_task in an existing cluster and getting the job_id in return. Then I am calling the run-now api to trigger the job. In this step, I want to send a list as argument via the…
Rony
  • 196
  • 2
  • 15
1 2 3
99
100