I am trying to pull a large dataset from SQL Server and dedupe the information using Python's dedupe library. I am using pyodbc as the db connector but I cannot figure out how to get the data into the correct format using SQL Server. Works OK on MySQL, but without the Dict row reads, the formatting of the data escapes me. Currently, I'm seeing the following error:
TypeError: row indices must be integers, not str
Here's the code that is trying to build the data:
cur = con.cursor()
print("\n\nExecuiting TOMIS Select")
cur.execute(TOMISSelect)
print("\nSelect Complete")
colHeader = [column[0] for column in cur.description]
temp_d = {0:tuple(colHeader)}
temp_data = {(i+1): row for i, row in enumerate(cur)}
temp_d.update(temp_data)
if os.path.exists(training_file):
print("\nReading labeled examples from ", training_file)
with open(training_file) as tf:
deduper.prepare_training(temp_d, tf)
else:
print("\nManual Training")
deduper.prepare_training(temp_d)
Here's the output and full trace:
Manual Training
Traceback (most recent call last):
File "C:\Users\01-workspace\02-dedupe\TOMISDeDupe\TomisFullDeDupe.py", line 134, in <module>
deduper.prepare_training(temp_d)
File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\api.py", line 806, in prepare_training
self.sample(data, sample_size, blocked_proportion, original_length)
File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\api.py", line 838, in sample
index_include=examples)
File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\labeler.py", line 403, in __init__
self.candidates = super().sample(data, blocked_proportion, sample_size)
File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\labeler.py", line 43, in sample
data)
File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\sampling.py", line 22, in blockedSample
*args))
File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\sampling.py", line 62, in dedupeSamplePredicates
items)
File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\sampling.py", line 73, in dedupeSamplePredicate
column = record[field]
TypeError: row indices must be integers, not str
I have tried multiple different methods for reading the data from SQL Server to no avail - The MySQL queries dump the data into the correct dictionary format, and I can't seem to get the data in the correct format using SQL Server.