I have some pyspark code that I package as a library so that it can be pip installed and used in other project. The code loads a parquet file which I include with my library. This works fine in most environments but it doesn't work on databricks.
After pip installing on databricks I can see the files at file:/databricks/python/lib/python3.7/site-package/my_package/my_parquet_dir
, but the load parquet file call doesn't work.
If I just let it try to load from /databricks/python/lib/python3.7/site-package/my_package/my_parquet_dir
it doesn't find the directory at all.
If I load from file:/databricks/python/lib/python3.7/site-package/my_package/my_parquet_dir
, it finds the directory but acts like the directory is empty. It almost seems like the parquet file load is able to recognize the top level directory (as long as I prepend "file:" to my path), but that subsequent calls be the loader to load individual files are failing because it's not prepending "file:".
...I'm just hoping someone has experience accessing data from file:/databricks
and knows some sort of trick.