2

I am trying to run following code to generate an additional col in pyspark df. The idea is to take the col from pyspark df and get the max of the scores by comparing the col with the list of keywords I have. (e.g. choices)

def get_max_sore(col):
    choices = ['hello','hello world','world hello']
    return max(process.extractOne(col, choices, scorer=fuzz.token_sort_ratio), process.extractOne(col, choices, scorer=fuzz.token_set_ratio), process.extractOne(col, choices))

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from pyspark.sql.functions import udf
get_max_udf = udf(get_max_sore, StringType())
sdf_1 = sparkDf.withColumn('new_col', get_max_udf(sparkDf.col))
sdf_1.show()

The last statement sdf_1.show() gives me an error.

Py4JJavaError: An error occurred while calling o1972.showString.
.
.
.
.
ModuleNotFoundError: No module named 'fuzzywuzzy'

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144)
    at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more

This is the first time I am working with spark and how it works. Please help. What functions can I use to perform the same, like fuzzy matching the col value with choices = ['hello','hello world','world hello']. Also, fuzzy package is installed in all the nodes.

trougc
  • 329
  • 3
  • 14
  • You sure you have the module `fuzzywuzzy` installed?? – Preetham Jan 13 '20 at 06:33
  • more specifically, to complete Preetham's comment, do you have fuzzywuzzy installed on each and every node ? – Steven Jan 13 '20 at 14:25
  • 1
    @Steven I checked and it is installed on all the nodes. – trougc Jan 13 '20 at 15:28
  • and you only have one python installed ? no virtual env ? the python you are checking on every node is the same that spark is using ? – Steven Jan 13 '20 at 15:57
  • @Steven ya. I have other virtual envs. I am checking on the same env for all the nodes and the package is installed on all of them. – trougc Jan 13 '20 at 17:02
  • 1
    @Steven Thanks. There are some other issues with the nodes I am working on. But I tried it on my personal computer and everything worked fine. I guess the issue is with the nodes or env. – trougc Jan 15 '20 at 04:49

1 Answers1

0

Install fuzzywuzzy using below command

pip install fuzzywuzzy