1

I want to use fuzz.ratio on a data frame, but I'm working on pyspark (I can't use pandas).

I have the function:

from fuzzywuzzy import fuzz

I create a data frame like this:

communes_corrompues=spark.createDataFrame(
[("VILLEAINTE", "VILLEPINTE"),
('QILLEPINTE'   ,'VILLEPINTE'),
('AHIENS'   ,'AMIENS'),
('AMIEPS'   ,'AMIENS'),
("CVRGY"    ,"CERGY"),
("CERGA"    ,"CERGY")
 ],
    ['corrompue', 'resultat']
)

And this sentence doesn't work:

communes_corrompues_ratio = communes_corrompues.withColumn("fuzzywuzzy_ratio",
lit(fuzz.ratio(col("resultat"),col("corrompue"))))

I have this error:

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

May someone help me? Or know how to do it?

James Westgate
  • 11,306
  • 8
  • 61
  • 68
Neoooar
  • 33
  • 7

1 Answers1

1

I'd try user defined functions for that, something like:

from pyspark.sql.functions import udf
from fuzzywuzzy import fuzz

@udf("int")
def fuzz_udf(a,b):
  return fuzz.ratio(a,b)

communes_corrompues_ratio.withColumn("fuzzywuzzy_ratio", fuzz_udf(col("resultat"),col("corrompue")).show()
matkurek
  • 553
  • 5
  • 12