Is it possible to use pathlib.Path
objects with spark.read.parquet
and other pyspark.sql.DataFrameReader
methods?
It doesn't work by default:
>>> from pathlib import Path
>>> basedir = Path("/data")
>>> spark.read.parquet(basedir / "name.parquet")
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-5-cec8ced1bc5d> in <module>
----> 1 spark.read.parquet(basedir / "name.parquet")
<... a long traceback ...>
/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_command_part(parameter, python_proxy_pool)
296 command_part += ";" + interface
297 else:
--> 298 command_part = REFERENCE_TYPE + parameter._get_object_id()
299
300 command_part += "\n"
AttributeError: 'PosixPath' object has no attribute '_get_object_id'
I tried to write py4j type converter:
class PathConverter(object):
def can_convert(self, object):
return isinstance(object, Path)
def convert(self, object, gateway_client):
JavaString = JavaClass("java.lang.String", gateway_client)
return JavaString(str(object))
register_input_converter(PathConverter())
But it looks like I misunderstood some string conversion related concepts/specifics, because jvm.java.lang.String("string")
in py4j returns the python str
object:
>>> spark.read.parquet(basedir / "name.parquet")
<... a long traceback ...>
/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
1306
1307 for temp_arg in temp_args:
-> 1308 temp_arg._detach()
AttributeError: 'str' object has no attribute '_detach'