0

I have been trying to convert a GeoPandas dataframe to PySpark Dataframe with no success. Currently, I have extended the DataFrame class to convert a GPD DF to Spark DF with the following:

from pyspark.sql import DataFrame
from pyspark.sql.types import IntegerType, StringType, FloatType, BooleanType, DateType, TimestampType, StructField, StructType
!pip install geospark
from geospark.sql.types import GeometryType

class SPandas(DataFrame):
  def __init__(self, sqlC, objgpd):
    esquema = dict(objgpd.dtypes)
    equivalencias = {'int64' : IntegerType, 'object' : StringType, 'float64' : FloatType, 
                     'bool' : BooleanType, 'datetime64' : DateType,
                     'timedelta' : TimestampType, 'geometry' : GeometryType}

    for clave, valor in esquema.items():
      try:
        esquema[clave] = equivalencias[str(valor)]
      except KeyError:
        esquema[clave] = StringType

    esquema = StructType([ StructField(v, esquema[v](), False) for v in esquema.keys() ])
    datos = sqlC.createDataFrame(objgpd, schema=esquema)
    super(self.__class__, self).__init__(datos._jdf, datos.sql_ctx)

The preceding code compiles without error, but when trying to 'take' an item from the DataFrame I get the following error:

fp = "Paralela/Barrios/Barrios.shp"
map_df = gpd.read_file(fp)
mapa_sp = SPandas(sqlC, map_df)
mapa_sp.take(1)

Py4JJavaError: An error occurred while calling o21.applySchemaToPythonRDD.
: java.lang.ClassNotFoundException: org.apache.spark.sql.geosparksql.UDT.GeometryUDT

The problem is with the 'geometry' column of the GDP DF, as it works flawlessly without it. The 'geometry' column has Shapely Polygon objects which should be recognized by the GeometryType class of GeoSpark.

Is there any way to install org.apache.spark.sql.geosparksql.UDT.GeometryUDT? I'm using Google Colab.

minimino
  • 93
  • 3
  • 11

1 Answers1

1

You need to include geospark dependency in hour project and add the jar to your runtime env. classpath . Below version of jar is compatible with spark-core_2.11:2.3.0

<dependency>
    <groupId>org.datasyslab</groupId>
    <artifactId>geospark</artifactId>
    <version>1.3.1</version>
    <scope>provided</scope>
</dependency>
QuickSilver
  • 3,915
  • 2
  • 13
  • 29
  • The issue now is that the following error is raised: ValueError: field geometry: is not an instance of type GeometryType Which is weird because GeometryType does support Shapely Polygons. Do you happen to know anything about this or should I open a new thread? – minimino Jun 15 '20 at 14:18
  • For anyone wondering how to install this in Google Colab, just add the following line BEFORE installing PySpark os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /geospark-1.3.1.jar pyspark-shell' – minimino Jun 15 '20 at 14:20
  • Ohh those are spark-submit params @minimino Thank you for the info – QuickSilver Jun 15 '20 at 14:20