2

I'm trying to use DBUtils from pyspark.dbutils outside databricks, it shows me no warning or error when copying files locally, but files are not present in target folder.

I can check if the file exists on DBFS with dbutils.fs.ls, and file do exist.

My pyspark session is configured with databricks-connect and I can do SQL with it.

This is how I configure databricks-connect to connect to my cluster:

DATABRICKS_ADDRESS="https://xxxxxxxxxxxxxxx.azuredatabricks.net/"
DATABRICKS_API_TOKEN="xxxxxxxxxxxxxxxxxxxxxxxx"
DATABRICKS_CLUSTER_ID="0000-0000-0000"
DATABRICKS_ORG_ID="0000000000000"
DATABRICKS_PORT="0000"
stdin_list = [DATABRICKS_ADDRESS, DATABRICKS_API_TOKEN, DATABRICKS_CLUSTER_ID, DATABRICKS_ORG_ID, DATABRICKS_PORT]
stdin_string = '\n'.join(stdin_list)
echo = subprocess.Popen((['echo', '-e', stdin_string]), stdout=subprocess.PIPE)
output = subprocess.check_output(('databricks-connect', 'configure'), stdin=echo.stdout)

Then I get the SparkSession from databricks cluster (if running; else waiting to wake up)

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

Now I can configure DBUtils and manipulate files on DBFS:

from pyspark.dbutils import DBUtils
dbutils = DBUtils(spark)
dbutils.fs.cp("dbfs:/data/project/my_file.model", "/mnt/c/Users/my_user/project/my_file.model")


22/09/12 12:00:45 WARN SparkServiceRPCClient: Large server response (46597327 bytes compressed)
22/09/12 12:00:45 WARN SparkServiceRPCClient: Large server response (49063049 bytes total)
22/09/12 12:00:45 WARN DBFS: DBFS open on dbfs:/data/project/my_file.model took 6857 ms
22/09/12 12:00:52 WARN DBFS: DBFS create on /mnt/c/Users/my_user/project/my_file.model took 6758 ms

But my folder at /mnt/c/Users/my_user/project/, there is no file my_file.model.

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/c/Users/my_user/project/my_file.model'
BeGreen
  • 765
  • 1
  • 13
  • 39

1 Answers1

0

I found out that I was moving the data inside the DBFS. So the behaviour is normal. I thought the tool was made to interract local/remote files.

What I needed is to use DBFS CLI or the internal pyspark fonctions. (works with any distributed file system, DBFS/HDFS)

If links dies, you can look up for JVM gateway or copyToLocalFile pyspark

BeGreen
  • 765
  • 1
  • 13
  • 39