1

I am trying to pull a large dataset from SQL Server and dedupe the information using Python's dedupe library. I am using pyodbc as the db connector but I cannot figure out how to get the data into the correct format using SQL Server. Works OK on MySQL, but without the Dict row reads, the formatting of the data escapes me. Currently, I'm seeing the following error:

TypeError: row indices must be integers, not str

Here's the code that is trying to build the data:

cur = con.cursor()

print("\n\nExecuiting TOMIS Select")
cur.execute(TOMISSelect)
print("\nSelect Complete")
colHeader = [column[0] for column in cur.description]
temp_d = {0:tuple(colHeader)}
temp_data = {(i+1): row for i, row in enumerate(cur)}
temp_d.update(temp_data)

if os.path.exists(training_file):
    print("\nReading labeled examples from ", training_file)
    with open(training_file) as tf:
        deduper.prepare_training(temp_d, tf)
else:
    print("\nManual Training")
    deduper.prepare_training(temp_d)

Here's the output and full trace:

Manual Training
Traceback (most recent call last):

  File "C:\Users\01-workspace\02-dedupe\TOMISDeDupe\TomisFullDeDupe.py", line 134, in <module>
    deduper.prepare_training(temp_d)

  File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\api.py", line 806, in prepare_training
    self.sample(data, sample_size, blocked_proportion, original_length)

  File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\api.py", line 838, in sample
    index_include=examples)

  File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\labeler.py", line 403, in __init__
    self.candidates = super().sample(data, blocked_proportion, sample_size)

  File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\labeler.py", line 43, in sample
    data)

  File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\sampling.py", line 22, in blockedSample
    *args))

  File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\sampling.py", line 62, in dedupeSamplePredicates
    items)

  File "c:\users\01-workspace\02-dedupe\dedupe\dedupe\sampling.py", line 73, in dedupeSamplePredicate
    column = record[field]

TypeError: row indices must be integers, not str

I have tried multiple different methods for reading the data from SQL Server to no avail - The MySQL queries dump the data into the correct dictionary format, and I can't seem to get the data in the correct format using SQL Server.

Dale K
  • 25,246
  • 15
  • 42
  • 71
WmSadler
  • 21
  • 4

1 Answers1

0

I think you need to do something like

colHeader = tuple(column[0] for column in cur.description)
temp_d = {i: dict(zip(colHeader, row)) for i, row in enumerate(cur)}
fgregg
  • 3,173
  • 30
  • 37