I am inheriting a huge pyspark project and instead of using the the Databricks UI for development I would like to use vscode via databricks-connect. Because of this I am failing to determine the best practices for the following:
Because the project files were saved as .py in the repos, when I open them using VSCode its not recognising the databricks magic commands like run. So I can not run any cell that calls another notebook with
%run ./PATH/TO-ANOTHER-FILE
. Changinging the file to .ipynb or changing the call todbutils.notebook.run
will solve the issue but it will mean changing cells in almost 20 notebooks. Using dbutils also poses the next challenge.Since databricks creates the spark session for you behind the scenes, there was no need to use
spark = SparkSession.builder.getOrCreate()
when coding in the databricks UI. But when using databricks connect, you will have to manually create a SparkSession that connects to the remote cluster. This means for me to use dbutils I will have to do the following:from pyspark.dbutils import DBUtils dbutils = DBUtils(spark)
Changing the whole code base to fit my preferred developmental strategy does not seem to be justifiable. Any pointers on how I can circumvent this?