Here is the sample dataframe:
111853 \t Authentic Restaurant
108660 \tBone Jam
57176 \tBurgers and Barrels
77583 \tDelice de France @ Bonne Bouche - Kingston
39702 \tHarlington Tandoori
...
104056 食全超市 Fodal (Mile End) Supermarket
43244 食全超市 FODAL Supermarket
38112 食全超市 FODAL Supermarket
104045 香辣居 Chilli Legend
104144 香辣居 Chilli Legend - Express - Shadwell
I am using a rapidfuzz
library to find the string similarities.
similar_store_matrix = process.cdist(data_df['store_name'], data_df['store_name'], workers=-1, scorer=fuzz.token_set_ratio, score_cutoff=85)
similar_store_score_matrix = np.argwhere(similar_store_score_matrix >= 85)
The code above will make a score on each store_name
comparing it to the other store names on the dataframe if the similarity is greater than 85%.
If the dataframe rows is greater than 100,000, I am getting this error:
File "cpp_process_cdist.pyx", line 349, in cpp_process_cdist.cdist
File "cpp_process_cdist.pyx", line 279, in cpp_process_cdist.cdist_single_list
File "cpp_process_cdist.pyx", line 229, in cpp_process_cdist.cdist_single_list_similarity
numpy.core._exceptions.MemoryError: Unable to allocate 23.7 GiB for an array with shape (159414, 159414) and data type uint8
I really want to make a chunk on the dataframe, however, I need to compare every store_name
on the dataframe to all the other store_name
to check similarities. So I believe this will be out of the equation.
How will I be able to overcome this?