7

I am fairly new to Databricks, so forgive me for the lack of knowledge here. I am using the Databricks resource in Azure. I mainly use the UI right now, but I know some features are only available using databricks-cli, which I have setup but not used yet.

I have cloned my Git repo in Databricks Repos using the UI. Inside my repo, there is a Python file that I will like to run as a job.

Can I use Databricks Jobs to create a job that will call this Python file directly ? The only way that I have been able to make this work is to create and upload to dbfs another Python file that will call the file in my Databricks Repo.

Maybe it cannot be done, or maybe the path I use is incorrect. I tried with the following path structure when creating a job using a Python file and it did not work, unfortunately.

file:/Workspace/Repos/<user_folder>/<repo_name>/my_python_file.py

3 Answers3

6

One workaround is to create a wrapper notebook that calls this file, i.e.

from my_python_file import main
main()

Then you can schedule a job on this notebook

Zi Dong
  • 61
  • 1
  • 1
    That is what I am using right now. I would prefer not to have a wrapper notebook, but it works and it is simple. – Emilie Picard-Cantin Nov 25 '21 at 14:26
  • @EmiliePicard-Cantin can you help me out? I have exactly the same problem as you. But when I say "from my_python_file import main" in the wrapper notebook it says "No module named "my_python_file". Did you have to do anything special to make this wrapper solution work? – Brendan Hill Dec 01 '21 at 08:33
  • @BrendanHill I have had the same problem. Are your notebook and python file in the same folder ? It worked for me when they were in the exact same folder. Otherwise, I will have to do more digging. – Emilie Picard-Cantin Dec 07 '21 at 12:57
4

I resolved this by adding markdown to my python script, so Databricks recognize it as a Databricks notebook:

# Databricks notebook source

# COMMAND ----------
import pyspark.sql.functions as f

df = spark.createDataFrame([
    (1,2)
], ['test_1', 'test_2'])
ARCrow
  • 1,360
  • 1
  • 10
  • 26
0

1- install in VS studio databricks-cli by typing pip install databricks-cli

From https://docs.databricks.com/dev-tools/cli/index.html

2- upload your python .py file into azure storage mounted on databricks (check how to mount azure storage on databricks)
3- connect to databricks from cli by typing in vs code terminal
Databricks configure --token
It will ask you for databricks instance URL then will ask you for personal token (you can generate that in settings in databricks check on how to generate token)

4- create databricks job instance by typing in terminal Databricks jobs create --json-file create-job.json

Contents of create-job.json

{
  "name": "SparkPi Python job",
  "new_cluster": {
    "spark_version": "7.3.x-scala2.12",
    "node_type_id": "Standard_F4",
    "num_workers": 2
  },
  "spark_python_task": {
    "python_file": "dbfs:/mnt/xxxxxx/raw/databricks-connectivity-test.py",
    "parameters": [
      "10"
    ]
  }
}

this information I gathered from youtube video below https://www.youtube.com/watch?v=XZFN0hOA8mY&ab_channel=JonWood

Moe
  • 1
  • 5- Run Job from databricks cli. Just type below in vs studio
    databricks jobs run-now --job-id 95
    – Moe Jan 20 '22 at 02:40
  • 1
    the question isn't about file on DBFS, but about file Repos - it's a different thing – Alex Ott Jan 20 '22 at 08:18