1

Here is the sample dataframe:

111853                         \t Authentic Restaurant
108660                                      \tBone Jam
57176                            \tBurgers and Barrels
77583     \tDelice de France @ Bonne Bouche - Kingston
39702                            \tHarlington Tandoori
                              ...
104056               食全超市 Fodal (Mile End) Supermarket
43244                           食全超市 FODAL Supermarket
38112                           食全超市 FODAL Supermarket
104045                               香辣居 Chilli Legend
104144          香辣居 Chilli Legend - Express - Shadwell

I am using a rapidfuzz library to find the string similarities.

similar_store_matrix = process.cdist(data_df['store_name'], data_df['store_name'], workers=-1, scorer=fuzz.token_set_ratio, score_cutoff=85)
similar_store_score_matrix = np.argwhere(similar_store_score_matrix >= 85)

The code above will make a score on each store_name comparing it to the other store names on the dataframe if the similarity is greater than 85%.

If the dataframe rows is greater than 100,000, I am getting this error:

  File "cpp_process_cdist.pyx", line 349, in cpp_process_cdist.cdist
  File "cpp_process_cdist.pyx", line 279, in cpp_process_cdist.cdist_single_list
  File "cpp_process_cdist.pyx", line 229, in cpp_process_cdist.cdist_single_list_similarity
numpy.core._exceptions.MemoryError: Unable to allocate 23.7 GiB for an array with shape (159414, 159414) and data type uint8

I really want to make a chunk on the dataframe, however, I need to compare every store_name on the dataframe to all the other store_name to check similarities. So I believe this will be out of the equation.

How will I be able to overcome this?

Tenserflu
  • 520
  • 5
  • 20
  • 1
    The easiest option is to buy more memory. The second easiest is to rework your program to not require everything to be in memory at once. You could chunk the dataframe on disk and load each chunk to compare with one another. – AKX Jan 05 '22 at 12:19
  • 2
    Why not try for `process.cdist(data_df['store_name'], data_df.loc[i:i+1000, 'store_name'], ...` in a loop, and every iteration, check for similarities, and filter out the important ones (> 85%). Then repeat for the next 1000 items. – 9769953 Jan 05 '22 at 12:25
  • 1
    As noted by others you will need to process the data in smaller chunks. Cdist has to create a matrix of size `len(queries) * len(choices) * sizeof(numpy datatype)` to store results. You can specify a different numpy datatype using the dtype argument. However uint8 is already the smallest one possible. – maxbachmann Jan 05 '22 at 23:46

0 Answers0