I am trying to find "similar" rows in a dataframe based on a particular column. For example, let's say we have this data -
+---+------+
| id| fruit|
+---+------+
| 1| apple|
| 2| appl|
| 3|banana|
| 4| ora|
| 5| banan|
| 6| bananana|
+---+------+
Since for id 1 and 2 the fruit is similar, we group them together in a list. Similarly we can do that for banana. So, the final output that I want to get is [[1, 2], [3, 5, 6], [4]].
Note that this is an example, in reality my data is very large where I want to do a similar thing using pyspark.
I tried using the ratio function in fuzzywuzzy library. I don't know how to do this without converting the pyspark dataframe to pandas and then run a for loop to find the similar words. Any help is highly appreciated.
Thanks!