1

I am struggling to make a faster code to group similar product names(column "prep") within same "person_id" and same "TNVED". So sample of my dataframe looks like this: sample_of_dataframe

So I did dictionary on IIN_BINs and the keys to this dictionary were TNVEDs. The values of the keys were also dictionaries with keys as group_ids which were grouped by similarity with fuzzywuzzy.

from fuzzywuzzy import fuzz
import warnings

warnings.filterwarnings("ignore")

length = len(np.unique(df['IIN_BIN'].to_list()))   
t1 = time.time()
amount = 0
dict_main = {}
df['prep']=df['prep'].fillna("")

for BIN in np.unique(df['IIN_BIN'].to_list()):
    temp_list_BIN = df[df['IIN_BIN'] == BIN]['TNVED']
    dict_pre_main = {}
    
    for tnved in np.unique(temp_list_BIN):
        dict_temp = {}
        j = 0
        df_1_slice = df[df['IIN_BIN'] == BIN]
        df_1_slice = df_1_slice[df['TNVED'] == tnved]
        df_1_slice.reset_index(inplace = True)
        df_1_slice.drop(['index'], axis = 1, inplace = True)

        while len(df_1_slice) != 0:
            temp_list = []
            temp_del_list = []
            temp_fuzz_list = []
            temp_df = pd.DataFrame(columns = df_1_slice.columns)
            
            for i in range(0, len(df_1_slice)):
                fuz_rate = fuzz.token_sort_ratio(
                    df_1_slice['prep'][0], df_1_slice['prep'][i])

                if fuz_rate >=90:
                    temp_del_list.append(i)
                    temp_list.append([0,i,fuz_rate])
                    temp_fuzz_list.append(fuz_rate)
                    temp_df = temp_df.append(df_1_slice.loc[i])
                    dict_temp[j] = temp_df

            df_1_slice.drop(temp_del_list, axis = 0, inplace = True)
            df_1_slice.reset_index(inplace = True)
            df_1_slice.drop('index', axis = 1, inplace = True)
            j+=1

        dict_pre_main[tnved] = dict_temp

    dict_main[BIN] = dict_pre_main
    time_gone = time.time() - t1

    if amount%60 == 0:
        print('Percentage of BINs proceeded: ', amount/length,
            '%. Time gone from start: ', time_gone, ' s.')

    amount+=1

May be there is a faster way to do it because then I had to unpack all this dictionaries into one dataFrame which took me around 1-2 days for 2 million rows dataframe?

t1 = time.time()
temp_list = list(df.columns)
temp_list.append('group_sorted')
concat_full = pd.DataFrame(columns = temp_list)
length = len(dict_main.keys())    
amount = 0

for key_iin in dict_main.keys():
    for key_tnved in dict_main[key_iin].keys():
        for key_group_number in dict_main[key_iin][key_tnved].keys():
            dict_main[key_iin][key_tnved][key_group_number]['group_sorted'] = key_group_number
            concat_full = concat_full.append(
                dict_main[key_iin][key_tnved][key_group_number])
        
    time_gone = time.time() - t1

    if amount%60 == 0:
        print('Percentage of BINs proceeded: ', amount/length,
            '%. Time gone from start: ', time_gone, ' s.')

    amount+=1

concat_full.to_csv('item_desc_fuzzied.csv', index = False)

May be there is faster way to do so ?

maxbachmann
  • 2,862
  • 1
  • 11
  • 35

1 Answers1

0

You are using Fuzzywuzzy and since your suppressing warnings, I assume you are using the pure Python implementation. You should use fuzzywuzzy[speedup] or for even better performance RapidFuzz (I am the author). in RapidFuzz it can make sense to transform the fuzzy string matching from:

from fuzzywuzzy import fuzz

for i in range(0, len(df_1_slice)):
    fuz_rate = fuzz.token_sort_ratio(
        df_1_slice['prep'][0], df_1_slice['prep'][i])

    if fuz_rate >=90:
        temp_del_list.append(i)
        temp_list.append([0,i,fuz_rate])
        temp_fuzz_list.append(fuz_rate)
        temp_df = temp_df.append(df_1_slice.loc[i])
        dict_temp[j] = temp_df

to something like

from rapidfuzz import process, fuzz

matches = process.extract(
        df_1_slice['prep'][0],
        df_1_slice['prep'],
        scorer=fuzz.token_sort_ratio,
        limit=None,
        score_cutoff=90)

for choice, score, index in matches:
        temp_del_list.append(index)
        temp_list.append([0, index, score])
        temp_fuzz_list.append(choice)
        temp_df = temp_df.append(df_1_slice.loc[index])
        dict_temp[j] = temp_df

The usage of process.extract has the following advantages:

  • df_1_slice['prep'][0] is only preprocessed a single time
  • less function calls since it can iterate over the DataSeries in C/C++ (function calls in Python are slow)
  • it is possible to preprocess the query in more ways like e.g. only sorting the words in the query once

Note that this implementation tries to keep the end result after the loop similar and is probably only part of the solution. Your usage of Pandas can probably be improved, but I am not very familiar with Pandas myself. One thing I noticed is that some of the variables never seem to be used. An example for this is temp_fuzz_list. In case they are really unused it would make sense to remove them.

maxbachmann
  • 2,862
  • 1
  • 11
  • 35