I want to aggregate a dataset based on string field and how similar values are. For example, the following table will show some values:
ID | Name |
---|---|
0231 | Ebrahim Talaq |
45621 | Ebrahm Talaq |
32134 | Ebrahim Talaq L.L.C |
5431234 | Martin Cole |
The end result of the grouping should be like the following:
Name | Count |
---|---|
a value for Ebrahim Talaq, Ebrahm Talaq, Ebrahim Talaq L.L.C | 3 |
Martin Cole | 1 |
I could do a process where I do a join with the same table and calculate the distances between the strings (fuzzy match or Jaro-distance as examples) but that will be multi-step process.
Anyone have a better idea?