This should be easy, so maybe I've fried my brain, but I can't get this to work. I have a database with over a million rows. I want to add about 15k more rows, but some of the data might already exist in the database.
My basic process is this:
async def process_data(object_list):
# Load pre-existing data from db:
pks = await copy_pks()
# Remove duplicates:
filtered_data = [obj for obj in object_list if obj.id not in pks]
# Add unique values:
await copy_in(filtered_data)
My copy_pks
method:
@with_connection:
async def copy_pks(self, connection):
pks = set()
async with connection.cursor().copy(sql) as cursor:
async for row in cursor:
res = await cursor.read_row()
pks.add(int(res[0]))
return pks
My copy_in
method:
@with_connection
async def copy_in(self, connection, data):
async with connection.cursor().copy(sql) as cursor:
for obj in data:
await cursor.write_row(data)
The problem is that I keep hitting duplicate key value violates unique constraint
despite my attempt to filter. I've checked the length of object_list
and the length of filtered_data
and the latter is about half the size... so clearly something is getting filtered.
I then checked specifically for the offending object:
for obj in object_list:
if obj.id == 14856714938:
print("Found")
if obj.id in pks:
print("In pks")
if obj.id in filtered_data:
print("In filtered")
I receive "Found", but I don't receive "In pks" and I don't receive "In filtered."
Unless I'm blind to some syntax or logic error here, my best shot in the dark is that it's a RAM issue, given the size of the data?