1

I have a PYSPARK dataframe df with values 'latitude' and 'longitude':

+---------+---------+
| latitude|longitude|
+---------+---------+
|51.822872| 4.905615|
|51.819645| 4.961687|
| 51.81964| 4.961713|
| 51.82256| 4.911187|
|51.819263| 4.904488|
+---------+---------+

I want to get the UTM coordinates ('x' and 'y') from the dataframe columns. To do this, I need to feed the values 'longitude' and 'latitude' to the following function from pyproj. The result 'x' and 'y' should then be append to the original dataframe df. This is how I did it in Pandas:

from pyproj import Proj
pp = Proj(proj='utm',zone=31,ellps='WGS84', preserve_units=False)
xx, yy = pp(df["longitude"].values, df["latitude"].values)
df["X"] = xx
df["Y"] = yy

How would I do this in Pyspark?

Jeroen
  • 801
  • 6
  • 20

1 Answers1

2

Use pandas_udf, feed the function with an array and then return an array as well. see below:

from pyspark.sql.functions import array, pandas_udf, PandasUDFType
from pyproj import Proj
from pandas import Series

@pandas_udf('array<double>', PandasUDFType.SCALAR)
def get_utm(x):
  pp = Proj(proj='utm',zone=31,ellps='WGS84', preserve_units=False)
  return Series([ pp(e[0], e[1]) for e in x ])

df.withColumn('utm', get_utm(array('longitude','latitude'))) \
  .selectExpr("*", "utm[0] as X", "utm[1] as Y") \
  .show()
jxc
  • 13,553
  • 4
  • 16
  • 34
  • I got a variety of errors in the `df.withColumn('utm', get_utm(array('longitude','latitude'))) \ .selectExpr("*", "utm[0] as X", "utm[1] as Y") ` Such as: - Python worker failed to connect back. - java.net.SocketTimeoutException: Accept timed out Is there another way or are these errors which are worth solving? – Jeroen May 27 '20 at 19:01
  • 1
    @Jeroen, there is a known issue regarding pandas_udf and the pyarrow version, see link: https://stackoverflow.com/questions/61202005/pyspark-2-4-5-illegalargumentexception-when-using-pandasudf. If it's not the same issue, can you post the ERROR? – jxc May 27 '20 at 19:08
  • I checked my version of Pyspark, which is v2.4.5 and pyarrow version was 0.13. According to the link I had to downgrade to version 0.10, which I did. I checked with conda list and it shows version 0.10 for Pyarrow. But now I get the error: `ImportError: PyArrow >= 0.8.0 must be installed; however, it was not found.` – Jeroen May 27 '20 at 19:47
  • can you give me your version of PyArrow and Pyspark? – Jeroen May 27 '20 at 20:01
  • I accepted the answer, but still haven't solved the Pyarrow problem. I will look into it this weekend more. – Jeroen May 29 '20 at 06:06
  • Still haven't solved the problem. Do you get the same when typing df.printSchema(): |-- latitude: float (nullable = true) |-- longitude: float (nullable = true) |-- utm: array (nullable = true) | |-- element: double (containsNull = true) |-- X: double (nullable = true) |-- Y: double (nullable = true) – Jeroen Jun 02 '20 at 20:13
  • @Jeroen, yes, the schema is about the same except latitude+longitude are both double. what is the ERROR you had now? BTW. I tested the code on my Laptop, in a cluster env, you will need to make sure all necessary modules (Proj) available on all workers. – jxc Jun 02 '20 at 20:23
  • I just installed pyspark, pandas, pyproj and pyarrow, numpy in a virtual environment and run the script from a virtual environment. I don't know how to set up a cluster environment to distribute the necessary modules on all workers, is there a clear explanation available how to do so? I can share my actual script to see if that works on your computer this weekend. – Jeroen Jun 04 '20 at 06:59