DBUtils downloading files but files not present locally

Question

I'm trying to use DBUtils from pyspark.dbutils outside databricks, it shows me no warning or error when copying files locally, but files are not present in target folder.

I can check if the file exists on DBFS with dbutils.fs.ls, and file do exist.

My pyspark session is configured with databricks-connect and I can do SQL with it.

This is how I configure databricks-connect to connect to my cluster:

DATABRICKS_ADDRESS="https://xxxxxxxxxxxxxxx.azuredatabricks.net/"
DATABRICKS_API_TOKEN="xxxxxxxxxxxxxxxxxxxxxxxx"
DATABRICKS_CLUSTER_ID="0000-0000-0000"
DATABRICKS_ORG_ID="0000000000000"
DATABRICKS_PORT="0000"
stdin_list = [DATABRICKS_ADDRESS, DATABRICKS_API_TOKEN, DATABRICKS_CLUSTER_ID, DATABRICKS_ORG_ID, DATABRICKS_PORT]
stdin_string = '\n'.join(stdin_list)
echo = subprocess.Popen((['echo', '-e', stdin_string]), stdout=subprocess.PIPE)
output = subprocess.check_output(('databricks-connect', 'configure'), stdin=echo.stdout)

Then I get the SparkSession from databricks cluster (if running; else waiting to wake up)

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

Now I can configure DBUtils and manipulate files on DBFS:

from pyspark.dbutils import DBUtils
dbutils = DBUtils(spark)
dbutils.fs.cp("dbfs:/data/project/my_file.model", "/mnt/c/Users/my_user/project/my_file.model")


22/09/12 12:00:45 WARN SparkServiceRPCClient: Large server response (46597327 bytes compressed)
22/09/12 12:00:45 WARN SparkServiceRPCClient: Large server response (49063049 bytes total)
22/09/12 12:00:45 WARN DBFS: DBFS open on dbfs:/data/project/my_file.model took 6857 ms
22/09/12 12:00:52 WARN DBFS: DBFS create on /mnt/c/Users/my_user/project/my_file.model took 6758 ms

But my folder at /mnt/c/Users/my_user/project/, there is no file my_file.model.

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/c/Users/my_user/project/my_file.model'

score 0 · Answer 1 · answered Sep 15 '22 at 09:38

0

I found out that I was moving the data inside the DBFS. So the behaviour is normal. I thought the tool was made to interract local/remote files.

What I needed is to use DBFS CLI or the internal pyspark fonctions. (works with any distributed file system, DBFS/HDFS)

If links dies, you can look up for JVM gateway or copyToLocalFile pyspark

answered Sep 15 '22 at 09:38

BeGreen

765
1
13
39

have you tried with specifying the local path as `file:/....` ? – Alex Ott Sep 17 '22 at 09:27

DBUtils downloading files but files not present locally

1 Answers1