I am trying to run following code to generate an additional col in pyspark df. The idea is to take the col from pyspark df and get the max of the scores by comparing the col with the list of keywords I have. (e.g. choices)
def get_max_sore(col):
choices = ['hello','hello world','world hello']
return max(process.extractOne(col, choices, scorer=fuzz.token_sort_ratio), process.extractOne(col, choices, scorer=fuzz.token_set_ratio), process.extractOne(col, choices))
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from pyspark.sql.functions import udf
get_max_udf = udf(get_max_sore, StringType())
sdf_1 = sparkDf.withColumn('new_col', get_max_udf(sparkDf.col))
sdf_1.show()
The last statement sdf_1.show()
gives me an error.
Py4JJavaError: An error occurred while calling o1972.showString.
.
.
.
.
ModuleNotFoundError: No module named 'fuzzywuzzy'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144)
at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
This is the first time I am working with spark and how it works. Please help.
What functions can I use to perform the same, like fuzzy matching the col value with choices = ['hello','hello world','world hello']
. Also, fuzzy package is installed in all the nodes.