Find similar rows of a pyspark dataframe based on a particular column using fuzzywuzzy library

Asked Aug 16 '23 at 18:39

Active Aug 16 '23 at 19:17

Viewed 21 times

I am trying to find "similar" rows in a dataframe based on a particular column. For example, let's say we have this data -

+---+------+
| id| fruit|
+---+------+
|  1| apple|
|  2|  appl|
|  3|banana|
|  4|   ora|
|  5| banan|
|  6| bananana|
+---+------+

Since for id 1 and 2 the fruit is similar, we group them together in a list. Similarly we can do that for banana. So, the final output that I want to get is [[1, 2], [3, 5, 6], [4]].

Note that this is an example, in reality my data is very large where I want to do a similar thing using pyspark.

I tried using the ratio function in fuzzywuzzy library. I don't know how to do this without converting the pyspark dataframe to pandas and then run a for loop to find the similar words. Any help is highly appreciated.

Thanks!

edited Aug 16 '23 at 18:52

asked Aug 16 '23 at 18:39

DonkeyKong

Find similar rows of a pyspark dataframe based on a particular column using fuzzywuzzy library

0 Answers0