How to avoid duplicates on copying data from Python pandas dataframe to SQL database SQLLite?

Question

I want to insert a pandas dataframe into a sqltable that I have pre-created using SQL alchemy.

However, on doing so I get duplicates, as SQLite has an index column and when I copy from the dataframe, it is taking a different index and even if the data is the same, it may append it.

Details about my status:

My pandas dataframe does not have an index on insert. (index=False) Also even though index is primary key, it is the remaining data that is important. No data row shall be duplicate, e.g. have the same combination of contents for each of the columns.

My SQL alchemy create table: (From database.py)

class SQLTable(Base):
    __tablename__ = 'sqltable'
    __table_args__ = {'sqlite_autoincrement': True}
    index = Column(Integer, primary_key= True, nullable=False)
    A = Column(Integer, nullable=True)
    B  = Column(String, nullable=True)
    C  = Column(BigInteger, nullable=True)
    D  = Column(Integer, nullable=True)
    E  = Column(String, nullable=True)
    F  = Column(Float, nullable=True)

However, I only wish to insert into the sqltable if the line other than the index is not already existant. I found python-pandas-tosql-only-insert-new-rows however I dont know how to apply this to my data.

 df.to_sql(
     name="sqltable", 
     con=engine,
     if_exists='replace',
     index=False,
     dtype={
         'index': sqlalchemy.types.INTEGER(),
         'A': sqlalchemy.types.INTEGER(),
         'B': sqlalchemy.types.VARCHAR(length=255),
         'C': sqlalchemy.types.BIGINT(),
         'D': sqlalchemy.types.VARCHAR(length=255),
         'E': sqlalchemy.types.VARCHAR(length=255),
         'F': sqlalchemy.types.Float(precision=3, asdecimal=True)
     }
 )

How to go about the index? What is the best way to do this?

Online tutorial for inserting new rows into SQL table

Have you tried without the line __table_args__ = {'sqlite_autoincrement': True} ? It will **always** autoincrement your index See: https://stackoverflow.com/questions/4567574/pylons-sqlite-and-autoincrementing-fields . I don't think this is your desired behavior. — above_c_level, Mar 19 '20 at 16:35
how big is your data smiley? personally I would use a store proc and use a hash function to only bring in the delta but I don't think you can use stored procedures in SQL? will add a pandas solution below — Umar.H, Mar 20 '20 at 23:52

Oluwafemi Sule · Answer 1 · 2020-03-16T09:09:03.403

2

Write the panda rows to a different table as placeholder.

import pandas as pd

col_options = dict(
    dtype={
        'index': sqlalchemy.types.INTEGER(),
        'A': sqlalchemy.types.INTEGER(),
        'B': sqlalchemy.types.VARCHAR(length=255),
        'C': sqlalchemy.types.BIGINT(),
        'D': sqlalchemy.types.VARCHAR(length=255),
        'E': sqlalchemy.types.VARCHAR(length=255),
        'F': sqlalchemy.types.Float(precision=3, asdecimal=True)
    }
)
df.to_sql(name="sqltable_temp", con=engine, if_exists='replace', index=False, **col_options)

Retrieve only new records when comparing values in placeholder table with those in table you intend to write new records in.

query = """
    SELECT A, B, C, D, E, F FROM sqltable_temp 
    EXCEPT 
    SELECT A, B, C, D, E, F FROM sqltable;
"""

new_entries = pd.read_sql(query, con=engine, **col_options)

Append new records in the table.

new_entries.to_sql(
    name="sqltable", con=engine, if_exists='append', **col_options)

Finally, delete placeholder table

engine.execute("DROP TABLE sqltable_temp;")

edited Mar 16 '20 at 09:09

answered Mar 15 '20 at 17:02

Oluwafemi Sule

36,144
1
56
81

Unexpected keyword argument 'dtype' in function callpylint(unexpected-keyword-arg) when I hover over new_entries = pd.read_sql(query, **db_options) over pd, there seems to be an error. also the sqltable_temp is not erased, even when I add DROP TABLE sqltable_temp; What can I do? I use Visual Studio Code editor. It seems from new_entries on there is something not in, also the code is not executed from new entries on line. can you help me please? – Smiley Mar 15 '20 at 22:34
Thanks for pointing out these errors. `dtypes` is not a valid option to `DataFrame.read_sql`. Take out the `con` option dictionary and pass it separately – Oluwafemi Sule Mar 16 '20 at 09:12
can it also be done with an index, because its still not processing and skiooing the lines?? – Smiley Mar 18 '20 at 19:22
The code stops at pd.read_sql and it is not executed. Can you please help me? – Smiley Mar 19 '20 at 14:29
Not without better information. How are you sure it's skipping the lines? Was an exception raised? Perhaps, there were no new entries when `sqltable_temp` is compared with `sqltable`. You can print `new_entries` to see if it's an empty dataframe. – Oluwafemi Sule Mar 19 '20 at 17:01

How to avoid duplicates on copying data from Python pandas dataframe to SQL database SQLLite?

1 Answers1