We have pyspark dataframe like:
df = spark.createDataFrame([(['target'], [2], [2], [3], [3]), (['NJ'],[3],[3], [4], [4]), (['target', 'target'],[4,5], [4,5], [6,7], [6,7]), (['CA'],[5],[5], [6], [6]), ], ('group_name', 'long', 'lat','com_long','com_lat'))
We want to extract the data at the position of target and use it to perform a distance calculation with a udf function.
First we want to get the index of the target position in the group_name column.
df = df.withColumn("target-1a-idx", (F.array_position(df.group_name, "target") -1 ))
df = df.withColumn("target-1a-idx",F.when(F.col("target-1a-idx")!=-1,F.col("target-1a-idx")))
Now we create the helper columns with the target index.
columns = ['long', 'lat','com_long','com_lat']
for col in columns:
df = df.withColumn(
prefix + col, F.col(col)[F.col("target-1a-idx")])
Filtering the Null values is optional.
df_filtered = df.filter(F.col("target-1a-idx").isNotNull())
Finally we defined a udf function to calculate distance, and called it
import geopy
from geopy.distance import geodesic
@F.udf(returnType=T.FloatType())
def geodesic_udf(a, b):
if (a is None) | (b is None):
return 1.0
else:
return geodesic(a, b).meters
df_filtered = df_filtered.withColumn(
"distance_to_station",
geodesic_udf(
F.array("target_long", "target_lat"),
F.array(
"target_com_long",
"target_com_lat",
),
),
)
`
ERROR MESSAGE
While we are absolutely sure that we installed and imported geopy and geodesic correctly, we recieved ModuleNotFoundError. We guess the problem is actually not with the module.
ModuleNotFoundError: No module named 'geopy'
Could you help us with the answer. Thank you!
Checked imports and installed packages (pip list). And filtered Nulls.