0

Is it possible to use pathlib.Path objects with spark.read.parquet and other pyspark.sql.DataFrameReader methods?

It doesn't work by default:

>>> from pathlib import Path
>>> basedir = Path("/data")
>>> spark.read.parquet(basedir / "name.parquet")
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-cec8ced1bc5d> in <module>
----> 1 spark.read.parquet(basedir / "name.parquet")

<... a long traceback ...>

/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_command_part(parameter, python_proxy_pool)
    296             command_part += ";" + interface
    297     else:
--> 298         command_part = REFERENCE_TYPE + parameter._get_object_id()
    299 
    300     command_part += "\n"

AttributeError: 'PosixPath' object has no attribute '_get_object_id'

I tried to write py4j type converter:

class PathConverter(object):
    def can_convert(self, object):
        return isinstance(object, Path)

    def convert(self, object, gateway_client):
        JavaString = JavaClass("java.lang.String", gateway_client)
        return JavaString(str(object))

register_input_converter(PathConverter())

But it looks like I misunderstood some string conversion related concepts/specifics, because jvm.java.lang.String("string") in py4j returns the python str object:

>>> spark.read.parquet(basedir / "name.parquet")
<... a long traceback ...>
/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1306 
   1307         for temp_arg in temp_args:
-> 1308             temp_arg._detach()

AttributeError: 'str' object has no attribute '_detach'
ei-grad
  • 792
  • 7
  • 19

1 Answers1

0

I have only one ugly solution for now:

diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py
index fa3e829a88..7441a8ba8c 100644
--- a/python/pyspark/sql/readwriter.py
+++ b/python/pyspark/sql/readwriter.py
@@ -298,7 +298,7 @@ class DataFrameReader(OptionUtils):
                        modifiedAfter=modifiedAfter, datetimeRebaseMode=datetimeRebaseMode,
                        int96RebaseMode=int96RebaseMode)
 
-        return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
+        return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths, converter=str)))
 
     def text(self, paths, wholetext=False, lineSep=None, pathGlobFilter=None,
              recursiveFileLookup=None, modifiedBefore=None,

Also, looking through the readwriter.py source code it feels safe enough to monkeypatch its version of _to_seq:

from pyspark.sql import readwriter

def converter(x):
    if isinstance(x, PurePath):
        return str(x)
    return x

readwriter._to_seq = partial(readwriter._to_seq, converter=converter)

Or maybe more correct and full workaround would be to monkeypatch the reader/writer methods directly:

@wraps(readwriter.DataFrameWriter.parquet)
def parquet(self, path, mode=None, partitionBy=None, compression=None):
    return parquet.__wrapped__(self, str(path), mode=mode,
                               partitionBy=partitionBy,
                               compression=compression)

readwriter.DataFrameWriter.parquet = parquet
ei-grad
  • 792
  • 7
  • 19