0

I am trying to clean a column using fuzzywuzzy using the following code:

import pyodbc 
from fuzzywuzzy import fuzz

# Getting sql list
conn = pyodbc.connect('Driver={SQL Server};'
                      'Server=USER-PC\SQLEXPRESS;'
                      'Database=AdventureWorks2014;'
                      'Trusted_Connection=yes;')
cursor = conn.cursor()
cursor.execute("SELECT top 100 Publisher FROM dbo.Publishers where publisher like '%Ebate%'")

Product_list = cursor.fetchall()

# Groups of names with Levenshtein distance > 77

Groups = list() 
for name in Product_list:
    for grp in Groups:
        if all(fuzz.ratio(name, wd) > 78 for wd in grp):
            grp.append(name)
            break
    else:
        Groups.append([name, ])

print('output Groups:', Groups)

outputs:

output Groups: [[('Ebates.com', ), ('Ebates', ), ('Ebates Inc', ), ('Ebates.com', ), ('Ebates', ), ('Ebates', ), ('Ebates', ), ('Ebates', ), ('Ebates.ca', ), ('Ebates.com', ), ('Ebates', )], [('ExtraRebates', ), ('Mr. Rebates', ), ('RebatesMe', ), ('ExtraRebates', ), ('Mr. Rebates', ), ('RebatesMe', )] ...

Being new to Python, I am stuck finding a way to:

  1. Name each group in a separate column.
  2. Export the results (grouping, publisher) to a new table on SQL.

Any help is much appreciated !

1 Answers1

0

For this, you could just use the pandas library (Just do "pip install pandas" on your terminal / Command Prompt).

You can create a table with one column per each of those results and push this table to your sql database:

import pandas as pd
tableresult = pd.DataFrame(Groups).T
tableresult.to_sql('table', engine, index=False, if_exists='replace')

in your "to_sql" command:

  1. index=False: avoids having an index column pushed in your sql table
  2. if_exists='replace': one of the way to push a table in sql

For more information, you can check there.

I hope it helps,

BR.

RenauV
  • 363
  • 2
  • 11