1

I have some pyspark code that I package as a library so that it can be pip installed and used in other project. The code loads a parquet file which I include with my library. This works fine in most environments but it doesn't work on databricks.

After pip installing on databricks I can see the files at file:/databricks/python/lib/python3.7/site-package/my_package/my_parquet_dir, but the load parquet file call doesn't work.

If I just let it try to load from /databricks/python/lib/python3.7/site-package/my_package/my_parquet_dir it doesn't find the directory at all.

If I load from file:/databricks/python/lib/python3.7/site-package/my_package/my_parquet_dir, it finds the directory but acts like the directory is empty. It almost seems like the parquet file load is able to recognize the top level directory (as long as I prepend "file:" to my path), but that subsequent calls be the loader to load individual files are failing because it's not prepending "file:".

...I'm just hoping someone has experience accessing data from file:/databricks and knows some sort of trick.

Subbu VidyaSekar
  • 2,503
  • 3
  • 21
  • 39
Mike
  • 21
  • 2

1 Answers1

1

turns out indeed preprending "file:" was the key and the issue I had was that in one spot I had misspelled it as "File:"

Mike
  • 21
  • 2
  • Hi @Mike hw did you append 'file' to above? I have a similar issue here: https://stackoverflow.com/questions/68202341/analysisexception-path-does-not-exist-dbfs-databricks-python-lib-python3-7-si any idea? – user3868051 Jun 30 '21 at 23:29