2

We have pyspark dataframe like:

df = spark.createDataFrame([(['target'], [2], [2], [3], [3]), (['NJ'],[3],[3], [4], [4]), (['target', 'target'],[4,5], [4,5], [6,7], [6,7]), (['CA'],[5],[5], [6], [6]), ], ('group_name', 'long', 'lat','com_long','com_lat'))

Schema

We want to extract the data at the position of target and use it to perform a distance calculation with a udf function.

First we want to get the index of the target position in the group_name column.

df = df.withColumn("target-1a-idx", (F.array_position(df.group_name, "target") -1 )) 
df = df.withColumn("target-1a-idx",F.when(F.col("target-1a-idx")!=-1,F.col("target-1a-idx")))

Now we create the helper columns with the target index.

columns = ['long', 'lat','com_long','com_lat']
for col in columns:
    df = df.withColumn(
        prefix + col, F.col(col)[F.col("target-1a-idx")])

DF with helper columns

Filtering the Null values is optional.

df_filtered = df.filter(F.col("target-1a-idx").isNotNull())

Finally we defined a udf function to calculate distance, and called it

import geopy
from geopy.distance import geodesic

@F.udf(returnType=T.FloatType())
def geodesic_udf(a, b):
    if (a is None) | (b is None):
        return 1.0
    else:
        return geodesic(a, b).meters

df_filtered = df_filtered.withColumn(
    "distance_to_station",
    geodesic_udf(
        F.array("target_long", "target_lat"),
        F.array(
            "target_com_long",
            "target_com_lat",
        ),
    ),
)

`

ERROR MESSAGE

While we are absolutely sure that we installed and imported geopy and geodesic correctly, we recieved ModuleNotFoundError. We guess the problem is actually not with the module.

ModuleNotFoundError: No module named 'geopy'

This is the error message:

Could you help us with the answer. Thank you!

Checked imports and installed packages (pip list). And filtered Nulls.

maketew
  • 31
  • 2

1 Answers1

0

The problem is with the use of the nodes. The library is not installed in the node. Using a udf does not use sparklogik but python and would need the library on each node.

-> If possible, do not use a udf but a pyspark/spark native function.

def calc_distance(df, suffix, lat1, lat2, lon1, lon2):

#Haversine formula to calculate the distance between two gps coordinates and return the calculated result as Spark dataframe definition.

df = df.withColumn('haversine_d{sf}'.format(sf=suffix), (F.pow(F.sin(F.radians(F.col(lat2) - F.col(lat1)) / 2), 2) +
                                       F.cos(F.radians(F.col(lat1))) * F.cos(F.radians(F.col(lat2))) *
                                       F.pow(F.sin(F.radians(F.col(lon2) - F.col(lon1)) / 2), 2)))
df = df.withColumn('distance_in_m{sf}'.format(sf=suffix), F.atan2(F.sqrt(F.col('haversine_d{sf}'.format(sf=suffix))), F.sqrt(-F.col('haversine_d{sf}'.format(sf=suffix)) + 1)) * 12742000)
df = df.drop('haversine_d{sf}'.format(sf=suffix))
return df

or

-> run the environment on each node. Description here


p.s. I am part of the team asking the question.

Florida Man
  • 2,021
  • 3
  • 25
  • 43