1

I am making a table in which the index of similar products are recorded, then data from those rows are pasted into a single row which agglomorates all the data. After this the row is deleted. The code is below:

matchedproducts_df = pd.read_sql_query("SELECT * from matchedproducts", conn)

print(len(matchedproducts_df.index))

def mergeduplicates(df, similarity_field='', databasetable='', similar_level=85):
    
    print(len(df.index))

    def check_simi(d):
        global dupl_indexes, dupl_originals
        dupl_originals = []
        dupl_indexes = []
        for i in range(len(d.values) - 1):
            for j in range(i + 1, len(d.values)):
                if fuzz.token_sort_ratio(d.values[i], d.values[j]) >= similar_level:
                    dupl_indexes.append(d.index[j])
                    dupl_originals.append(d.index[i])

       

    indexes = df.groupby([True]*len(df))[similarity_field].apply(check_simi)

    a = 0
    for i in dupl_indexes:
        if df.iloc[i, 5] == 'harveynichols':
            df.at[dupl_originals[a], 'pricehn'] = df.loc[i, 'price']
            df.at[dupl_originals[a], 'availabilityhn'] = df.loc[i, 'availability']
            df.at[dupl_originals[a], 'storehn'] = df.loc[i, 'store']
            df.at[dupl_originals[a], 'hyperlinkhn'] = df.loc[i, 'hyperlink']
            df.drop([i])
        elif df.iloc[i, 5] == 'houseoffraser':
            df.at[dupl_originals[a], 'pricehof'] = df.loc[i, 'price']
            df.at[dupl_originals[a], 'availabilityhof'] = df.loc[i, 'availability']
            df.at[dupl_originals[a], 'storehof'] = df.loc[i, 'store']
            df.at[dupl_originals[a], 'hyperlinkhof'] = df.loc[i, 'hyperlink']
            df.drop([i])
        elif df.iloc[i, 5] == 'selfridges':
            df.at[dupl_originals[a], 'pricesf'] = df.loc[i, 'price']
            df.at[dupl_originals[a], 'availabilitysf'] = df.loc[i, 'availability']
            df.at[dupl_originals[a], 'storesf'] = df.loc[i, 'store']
            df.at[dupl_originals[a], 'hyperlinksf'] = df.loc[i, 'hyperlink']
            df.drop([i])
        elif df.iloc[i, 5] == 'lookfantastic':
            df.at[dupl_originals[a], 'pricelf'] = df.loc[i, 'price']
            df.at[dupl_originals[a], 'availabilitylf'] =  df.loc[i, 'availability']
            df.at[dupl_originals[a], 'storelf'] = df.loc[i, 'store']
            df.at[dupl_originals[a], 'hyperlinklf'] = df.loc[i, 'hyperlink']
            df.drop([i])
        elif df.iloc[i, 5] == 'superdrug':
            df.at[dupl_originals[a], 'pricesd'] = df.loc[i, 'price']
            df.at[dupl_originals[a], 'availabilitysd'] = df.loc[i, 'availability']
            df.at[dupl_originals[a], 'storesd'] = df.loc[i, 'store']
            df.at[dupl_originals[a], 'hyperlinksd'] = df.loc[i, 'hyperlink']
            df.drop([i])
        elif df.iloc[i, 5] == 'boots':
            df.at[dupl_originals[a], 'priceboots'] = df.loc[i, 'price']
            df.at[dupl_originals[a], 'availabilityboots'] = df.loc[i, 'availability']
            df.at[dupl_originals[a], 'storeboots'] = df.loc[i, 'store']
            df.at[dupl_originals[a], 'hyperlinkboots'] = df.loc[i, 'hyperlink']
            df.drop([i])
        elif df.iloc[i, 5] == 'allbeauty':
            df.at[dupl_originals[a], 'priceab'] = df.loc[i, 'price']
            df.at[dupl_originals[a], 'availabilityab'] = df.loc[i, 'availability']
            df.at[dupl_originals[a], 'storeab'] = df.loc[i, 'store']
            df.at[dupl_originals[a], 'hyperlinkab'] = df.loc[i, 'hyperlink']
            df.drop([i])
        elif df.iloc[i, 5] == 'asos':
            df.at[dupl_originals[a], 'pricea'] = df.loc[i, 'price']
            df.at[dupl_originals[a], 'availabilitya'] = df.loc[i, 'availability']
            df.at[dupl_originals[a], 'storea'] = df.loc[i, 'store']
            df.at[dupl_originals[a], 'hyperlinka'] = df.loc[i, 'hyperlink']
            df.drop([i])

        # for index_list in indexes:
        #     df.drop(index_list, inplace=True)

        a += 1

    print(len(df.index))

    df.to_csv('C:/Users/Judoo/Desktop/matchedproducts.csv')

    df.to_sql(databasetable, conn, if_exists="replace")
    
    conn.commit()

    conn.close()

mergeduplicates(matchedproducts_df, similarity_field='name', databasetable='matchedproducts')

The code works well for finding duplicates and copying their data into the relevant original row, but is not dropping/deleting the row after the check. The length of the table before and after the script are the same. I am unsure of why this is happening, any help would be appreciated.

Judoo123
  • 49
  • 6
  • I think you would need to pass inplace=True, try pls.`df.drop([i],inplace=True)` – simpleApp May 01 '21 at 22:55
  • @simpleApp `inplace=True` is not recommended to be used, and I believe will be deprecated in a future version of pandas. but you are correct in essence, just re-assign your `df.drop` to your dataframe, `df = df.drop(..)` – Umar.H May 01 '21 at 23:09
  • I have implemented this but it now gives a ``` KeyError: 460```. Is there any reason this would happen? – Judoo123 May 02 '21 at 00:11
  • At this line specifically ```df.at[dupl_originals[a], 'priceboots'] = df.loc[i, 'price']``` – Judoo123 May 02 '21 at 00:15

1 Answers1

-2

df.drop() method usually removes the row that you wanted to remove a specific one. Have you ever tried to add inplace=True? Because it might keep the old index. Or, you can do the equal with new df like df = df.drop([i]). See here

If these do not fix your issue, you have to play with your structure of df.drop([i]).

YlmRdm
  • 1
  • 1
  • 1
  • This is not how you really answer here. Use their code and then modify accordingly then explain to them what is the main issue here. – Bhavyadeep Yadav May 02 '21 at 09:14