I'm trying to use DBUtils
from pyspark.dbutils
outside databricks, it shows me no warning or error when copying files locally, but files are not present in target folder.
I can check if the file exists on DBFS with dbutils.fs.ls
, and file do exist.
My pyspark session is configured with databricks-connect
and I can do SQL with it.
This is how I configure databricks-connect
to connect to my cluster:
DATABRICKS_ADDRESS="https://xxxxxxxxxxxxxxx.azuredatabricks.net/"
DATABRICKS_API_TOKEN="xxxxxxxxxxxxxxxxxxxxxxxx"
DATABRICKS_CLUSTER_ID="0000-0000-0000"
DATABRICKS_ORG_ID="0000000000000"
DATABRICKS_PORT="0000"
stdin_list = [DATABRICKS_ADDRESS, DATABRICKS_API_TOKEN, DATABRICKS_CLUSTER_ID, DATABRICKS_ORG_ID, DATABRICKS_PORT]
stdin_string = '\n'.join(stdin_list)
echo = subprocess.Popen((['echo', '-e', stdin_string]), stdout=subprocess.PIPE)
output = subprocess.check_output(('databricks-connect', 'configure'), stdin=echo.stdout)
Then I get the SparkSession from databricks cluster (if running; else waiting to wake up)
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
Now I can configure DBUtils and manipulate files on DBFS:
from pyspark.dbutils import DBUtils
dbutils = DBUtils(spark)
dbutils.fs.cp("dbfs:/data/project/my_file.model", "/mnt/c/Users/my_user/project/my_file.model")
22/09/12 12:00:45 WARN SparkServiceRPCClient: Large server response (46597327 bytes compressed)
22/09/12 12:00:45 WARN SparkServiceRPCClient: Large server response (49063049 bytes total)
22/09/12 12:00:45 WARN DBFS: DBFS open on dbfs:/data/project/my_file.model took 6857 ms
22/09/12 12:00:52 WARN DBFS: DBFS create on /mnt/c/Users/my_user/project/my_file.model took 6758 ms
But my folder at /mnt/c/Users/my_user/project/
, there is no file my_file.model
.
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/c/Users/my_user/project/my_file.model'