Fuzzy matching a string in in pyspark or SQL using Soundex function or Levenshtein distance

Question

I had to apply Levenshtein Function on last column when passport and country are same.

matrix = passport_heck.select(\
                                  f.col('name_id').alias('name_id_1'),
                                  f.col('last').alias('last_1'),
                                  f.col('country').alias('country_1'),
                                  f.col('passport').alias('passport_1')) \
         .crossJoin(passport_heck.select(\
                                             f.col('name_id').alias('name_id_2'),
                                             f.col('last').alias('last_2'),
                                             f.col('country').alias('country_2'),
                                             f.col('passport').alias('passport_2')))\
          .filter((f.col('passport_1') == f.col('passport_2')) & (f.col('country_1') == f.col('country_2')))```



res = matrix.withColumn('distance', levenshtein(f.col('last_1'), f.col('last_2')))

Now I am getting the following output which is totally fine.

Now I need to delete duplicates pair (example ID 558635 with 1106562 then 1106562 with 558635 comparing same content).

Can anyone please give me some logic in pyspark to get below table.

@MorrisdeOryx ,can you please have a look in question and guide me on how to solve this with pg_tgrm. reference [link](https://stackoverflow.com/questions/58840005/fuzzy-matching-a-string-in-sql) — aamirmalik124, Mar 05 '20 at 10:12
First of all you should applied [trim](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.trim) and [lower](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.lower) before you apply the groupby to further reduce the number rows. After that you can make a cross join (which might break your instance) to apply levenshtein and soundex. — cronoik, Mar 06 '20 at 09:48

lrnzcig · Accepted Answer · 2020-03-09T08:46:15.150

Your problem can become quite complicated if you want to get it right, but here you have some sample code in pyspark that hopefully gets you started.

First a tiny dataset,

tinydata = sqlContext.createDataFrame(
    [
        (3527524, 'aamir', 'al malik', 'aamir.almalik@gmail.com'),
        (4287983, 'aamir', 'al malik', 'aamir.almalik@company.com'),
        (200490, 'aamir', 'al malik', 'aamir.almalik@gmail.come'),
        (1906639, 'tahir', 'al malik', 'tahir.almalik@gmail.com')
    ],
    ['ID', 'first_NAME', 'last_NAME', 'EMAIL']
)

Then you convert it to a matrix of differences through a cross-join. Note that if you have 5 million, this will become huge. You need to avoid comparisons as much as possible, such as following some of the comments to your question, and other ideas you may come up with. Note the final filter is to avoid comparing 2 rows twice.

matrix = tinydata.select(F.col('ID').alias('ID1'), F.col('EMAIL').alias('EMAIL1')) \
    .crossJoin(tinydata.select(F.col('ID').alias('ID2'), F.col('EMAIL').alias('EMAIL2'))) \
    .filter(F.col('ID1') > F.col('ID2'))

After that, you can calculate distances.

def lev_dist(left, right):
    return Levenshtein.distance(left, right)

lev_dist_udf = udf(lev_dist, IntegerType())

res = matrix.withColumn('d', lev_dist_udf(F.col('EMAIL1'), F.col('EMAIL2')))

With the tiny example you get

res.show()
+-------+--------------------+-------+--------------------+---+
|    ID1|              EMAIL1|    ID2|              EMAIL2|  d|
+-------+--------------------+-------+--------------------+---+
|3527524|aamir.almalik@gma...| 200490|aamir.almalik@gma...|  1|
|3527524|aamir.almalik@gma...|1906639|tahir.almalik@gma...|  2|
|4287983|aamir.almalik@com...|3527524|aamir.almalik@gma...|  5|
|4287983|aamir.almalik@com...| 200490|aamir.almalik@gma...|  6|
|4287983|aamir.almalik@com...|1906639|tahir.almalik@gma...|  7|
|1906639|tahir.almalik@gma...| 200490|aamir.almalik@gma...|  3|
+-------+--------------------+-------+--------------------+---+

Thanks for pointing out @cronoik

No need for udf, should be something like this:

from pyspark.sql.functions import levenshtein

matrix = tinydata.select(F.col('ID').alias('ID1'), F.col('EMAIL').alias('EMAIL1')) \
    .crossJoin(tinydata.select(F.col('ID').alias('ID2'), F.col('EMAIL').alias('EMAIL2'))) \
    .filter(F.col('ID1') > F.col('ID2'))

res = matrix.withColumn('d', levenshtein(F.col('EMAIL1'), F.col('EMAIL2')))

@Irnzcig Thank you, I will try this and let u know in case of any doubt. — aamirmalik124, Mar 07 '20 at 18:41
You don't need an udf for [levenshtein](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=leven#pyspark.sql.functions.levenshtein) — cronoik, Mar 07 '20 at 20:59
I used it like `res = matrix.withColumn('distance', levenshtein(f.col('full_name_1'), f.col('full_name_2')))` — aamirmalik124, Mar 08 '20 at 13:25
@aamirmalik124: Have you checked the [documentation](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.soundex)? — cronoik, Mar 09 '20 at 02:52
@aamirmalik124 please don't do that, don't edit the question and leave it as a reference. After your edit, my answer does not make sense anymore. If you have a new question, just open it. — lrnzcig, Mar 10 '20 at 20:45
@aamirmalik124 Also, please note that this is not a software development service.... Please add code with whatever you have tried to solve the problem, and try to narrow down your question to something more concrete you don't know how to solve. Thanks. — lrnzcig, Mar 10 '20 at 20:46

Fuzzy matching a string in in pyspark or SQL using Soundex function or Levenshtein distance

1 Answers1

Linked