0

I am trying to run a code which I have not written. The description of the code says that this is a way to convert spark dataframe to a pandas dataframe in a speedy way and was borrowed from here.


def to_pandas(df: pyspark.sql.DataFrame, n_partitions: Optional[int] = None) -> pd.DataFrame:
   
    :param df: Spark DataFrame to be transformed to a Pandas DataFrame
    :param n_partitions: The target number of partitions
    :return: The Pandas DataFrame version of the provided ``df``
   
    if n_partitions is not None:
        df = df.repartition(n_partitions)

    df_pandas = df.rdd.mapPartitions(lambda rdds: [pd.DataFrame(list(rdds))]).collect()
    df_pandas = pd.concat(df_pandas, ignore_index=True)
    df_pandas.columns = df.columns

    return df_pandas

However, running this gives a Py4J error as follows:

Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.3.3\plugins\python-ce\helpers\pydev\_pydevd_bundle\pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "<input>", line 1, in <module>
  File "C:\ProgramData\Anaconda3\envs\dbconnect\lib\site-packages\pyspark\rdd.py", line 967, in collect
    sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File "C:\ProgramData\Anaconda3\envs\dbconnect\lib\site-packages\py4j\java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "C:\ProgramData\Anaconda3\envs\dbconnect\lib\site-packages\pyspark\sql\utils.py", line 117, in deco
    return f(*a, **kw)
  File "C:\ProgramData\Anaconda3\envs\dbconnect\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.io.StreamCorruptedException: invalid type code: 00
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1698)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
    at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:488)
    at sun.reflect.GeneratedMethodAccessor313.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2296)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
    at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:488)
    at sun.reflect.GeneratedMethodAccessor313.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2296)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
    at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:488)
    at sun.reflect.GeneratedMethodAccessor313.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2296)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
    at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:488)
    at sun.reflect.GeneratedMethodAccessor313.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2296)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2093)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1655)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
    at org.apache.spark.sql.util.ProtoSerializer.$anonfun$deserializeObject$1(ProtoSerializer.scala:6631)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at org.apache.spark.sql.util.ProtoSerializer.deserializeObject(ProtoSerializer.scala:6616)
    at com.databricks.service.SparkServiceRPCHandler.execute0(SparkServiceRPCHandler.scala:728)
    at com.databricks.service.SparkServiceRPCHandler.$anonfun$executeRPC0$1(SparkServiceRPCHandler.scala:477)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at com.databricks.service.SparkServiceRPCHandler.executeRPC0(SparkServiceRPCHandler.scala:372)
    at com.databricks.service.SparkServiceRPCHandler$$anon$2.call(SparkServiceRPCHandler.scala:323)
    at com.databricks.service.SparkServiceRPCHandler$$anon$2.call(SparkServiceRPCHandler.scala:309)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at com.databricks.service.SparkServiceRPCHandler.$anonfun$executeRPC$1(SparkServiceRPCHandler.scala:359)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at com.databricks.service.SparkServiceRPCHandler.executeRPC(SparkServiceRPCHandler.scala:336)
    at com.databricks.service.SparkServiceRPCServlet.doPost(SparkServiceRPCServer.scala:167)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:550)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
    at org.eclipse.jetty.server.Server.handle(Server.java:516)
    at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)
    at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
    at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
    at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:383)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:882)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1036)
    at java.lang.Thread.run(Thread.java:748)

What does this mean and how can it be circumvented? Thanks

deblue
  • 277
  • 4
  • 18

2 Answers2

0

I was getting the exact same "Invalid type code: 00" error from Pyspark. I was using databricks-connect version 9.1.10 which was breaking and to fix the problem I updated to the latest 9.1.* version (9.1.21) which fixed the issue. Give that a try.

iansydney
  • 1
  • 1
0

I know, the problem in the question is Py4JJavaError; but the need for this approach seems to be Spark to Pandas dataframe conversion.

In my experience, this type of conversion is subject to volume of the data. Also, pd.concat() may result in OutOfMemory error, if data size exceeds the limit, which driver can handle.

You should try multiple approaches and a do a volume v/s time comparison for each method.

The method which I follow is:

  • Export from Spark as parquet (CSV format accuracy is subject to proper handling of , in dataset)
spark_df.repartition(1).write.mode("overwrite").parquet("dbfs:/my_dir")
  • and then import in Pandas:
pandas_df = pd.read_parquet(/"dbfs/my_dir/part-0xxxxxxx.parquet")

The exported parquet filename is based on partition and timestamp. Hence, I've written following utility function to make life easier:

def export_spark_df_to_parquet(df, dir_dbfs_path, parquet_file_name):
  tmp_parquet_dir_name = "tmp"
  tmp_parquet_dir_dbfs_path = dir_dbfs_path + "/" + tmp_parquet_dir_name
  parquet_file_dbfs_path = dir_dbfs_path + "/" + parquet_file_name
  
  # Export dataframe to Parquet
  print("Exporting dataframe to Parquet...")
  df.repartition(1).write.mode("overwrite").parquet(tmp_parquet_dir_dbfs_path)
  listFiles = dbutils.fs.ls(tmp_parquet_dir_dbfs_path)
  for _file in listFiles:
    if len(_file.name) > len(".parquet") and _file.name[-len(".parquet"):] == ".parquet":
      print(f"Copying file {_file.path} to {parquet_file_dbfs_path}")
      dbutils.fs.cp(_file.path, parquet_file_dbfs_path)
      break

and use as:

export_spark_df_to_parquet(spark_df, "dbfs:/my_dir", "my_file.parquet")

Though, this answer does not directly address your problem; but can work for your dataframe conversion need.

Now, I can import from a fixed parquet file:

pandas_df = pd.read_parquet(/"dbfs/my_dir/my_file.parquet")
Azhar Khan
  • 3,829
  • 11
  • 26
  • 32