How to do a fuzzy match in PySpark UDF?

Question

I am trying to run following code to generate an additional col in pyspark df. The idea is to take the col from pyspark df and get the max of the scores by comparing the col with the list of keywords I have. (e.g. choices)

def get_max_sore(col):
    choices = ['hello','hello world','world hello']
    return max(process.extractOne(col, choices, scorer=fuzz.token_sort_ratio), process.extractOne(col, choices, scorer=fuzz.token_set_ratio), process.extractOne(col, choices))

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from pyspark.sql.functions import udf
get_max_udf = udf(get_max_sore, StringType())
sdf_1 = sparkDf.withColumn('new_col', get_max_udf(sparkDf.col))
sdf_1.show()

The last statement sdf_1.show() gives me an error.

Py4JJavaError: An error occurred while calling o1972.showString.
.
.
.
.
ModuleNotFoundError: No module named 'fuzzywuzzy'

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144)
    at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more

This is the first time I am working with spark and how it works. Please help. What functions can I use to perform the same, like fuzzy matching the col value with choices = ['hello','hello world','world hello']. Also, fuzzy package is installed in all the nodes.

more specifically, to complete Preetham's comment, do you have fuzzywuzzy installed on each and every node ? — Steven, Jan 13 '20 at 14:25
and you only have one python installed ? no virtual env ? the python you are checking on every node is the same that spark is using ? — Steven, Jan 13 '20 at 15:57
@Steven ya. I have other virtual envs. I am checking on the same env for all the nodes and the package is installed on all of them. — trougc, Jan 13 '20 at 17:02
@Steven Thanks. There are some other issues with the nodes I am working on. But I tried it on my personal computer and everything worked fine. I guess the issue is with the nodes or env. — trougc, Jan 15 '20 at 04:49

score 0 · Answer 1 · answered Feb 28 '23 at 12:56

0

Install fuzzywuzzy using below command

pip install fuzzywuzzy

answered Feb 28 '23 at 12:56

user21304531

1

Have you seen this part of question: `Also, fuzzy package is installed in all the nodes.`? – running.t Mar 13 '23 at 13:37

How to do a fuzzy match in PySpark UDF?

1 Answers1

Linked