2

I have a few different functions that scrape different tables with pandas, saves each to a dataframe, and saves them to a PostgreSQL database. I am able to successfully scrape and save each table as a dataframe, but I am having a bit of an issue when saving it to SQL. I am trying to do this with something like the below:

from sqlalchemy import create_engine

# Opening sql connection
engine = create_engine('postgresql://postgres:pw@localhost/name')
con = engine.connect()

def df1():
    df = scraped_data
    df.to_sql(table_name, con, if_exists='replace')
df1()

def df2():
    df = scraped_data
    df.to_sql(table_name, con, if_exists='replace')
df2()

# Closing connection
con.close()

I am able to successfully save df1 to SQL, but I get an error when running df2. The only real difference between the two functions is that they are scraping data from different sources. Everything else is essentially the same.

I have a couple other functions for other dataframes, but only the first one ever works no matter which order I call the functions.

I keep getting the same error for all of the other functions I call:

psycopg2.ProgrammingError: incomplete placeholder: '%(' without ')'

They also linked a page for background on the error: http://sqlalche.me/e/f405), although I still don't quite know what to make of it.

I just find it strange how it works for one function but not the others when the only thing that changes is the url that I am scraping from.

EDIT

I am scraping data from the NFL's website.

df1 iterates through years in a table from http://www.nfl.com/stats/categorystats?archive=false&conference=null&role=TM&offensiveStatisticCategory=GAME_STATS&defensiveStatisticCategory=null&season=2019&seasonType=REG&tabSeq=2&qualified=false&Submit=Go.

df2 does a very similar thing but pulls data from http://www.nfl.com/stats/categorystats?archive=false&conference=null&role=TM&offensiveStatisticCategory=TEAM_PASSING&defensiveStatisticCategory=null&season=2019&seasonType=REG&tabSeq=2&qualified=false&Submit=Go.

It looks like that the main difference is that df1 uses Pct to represent percentage in a column header whereas df2 uses %

messy748
  • 327
  • 4
  • 16
  • i think it has to do with the con.close option. try and run it in such a way that both functions execute before getting to the con.close() line. or better, try it without the con.close and see if it writes successfully. if it does, then u can make adjustments for the connection clause – sammywemmy Feb 02 '20 at 21:44
  • The second argument to `to_sql` needs to be `engine`, not `con`. – Gord Thompson Feb 02 '20 at 22:26
  • I tried both of the suggestions above but I am still getting the same error that I described in the post. – messy748 Feb 03 '20 at 03:19
  • If the only difference is the data, then *please include a minimal sample of said data* in the question, or in other words produce a [mcve]. It usually also helps to include the *full traceback*, not just the error message. It would have included essential information in this case as well. – Ilja Everilä Feb 03 '20 at 07:43

1 Answers1

3

TL;DR: You have a potential SQL injection hole.

The problem is that one of your column names contains %. Here is a minimal reproducible example:

In [8]: df = pd.DataFrame({"%A": ['x', 'y', 'z']})

In [9]: df.to_sql('foo', engine, if_exists='replace')

which produces the following log and traceback:

...
INFO:sqlalchemy.engine.base.Engine:
DROP TABLE foo
INFO:sqlalchemy.engine.base.Engine:{}
INFO:sqlalchemy.engine.base.Engine:COMMIT
INFO:sqlalchemy.engine.base.Engine:
CREATE TABLE foo (
        index BIGINT, 
        "%%A" TEXT
)


INFO:sqlalchemy.engine.base.Engine:{}
INFO:sqlalchemy.engine.base.Engine:COMMIT
INFO:sqlalchemy.engine.base.Engine:BEGIN (implicit)
INFO:sqlalchemy.engine.base.Engine:INSERT INTO foo (index, "%%A") VALUES (%(index)s, %(%A)s)
INFO:sqlalchemy.engine.base.Engine:({'index': 0, '%A': 'x'}, {'index': 1, '%A': 'y'}, {'index': 2, '%A': 'z'})
INFO:sqlalchemy.engine.base.Engine:ROLLBACK
---------------------------------------------------------------------------
ProgrammingError                          Traceback (most recent call last)
~/Work/sqlalchemy/lib/sqlalchemy/engine/base.py in _execute_context(self, dialect, constructor, statement, parameters, *args)
   1239                     self.dialect.do_executemany(
-> 1240                         cursor, statement, parameters, context
   1241                     )

~/Work/sqlalchemy/lib/sqlalchemy/dialects/postgresql/psycopg2.py in do_executemany(self, cursor, statement, parameters, context)
    854         if self.executemany_mode is EXECUTEMANY_DEFAULT:
--> 855             cursor.executemany(statement, parameters)
    856             return

ProgrammingError: incomplete placeholder: '%(' without ')'

The above exception was the direct cause of the following exception:

ProgrammingError                          Traceback (most recent call last)
<ipython-input-9-88cf8a93ad8c> in <module>()
----> 1 df.to_sql('foo', engine, if_exists='replace')

...

ProgrammingError: (psycopg2.ProgrammingError) incomplete placeholder: '%(' without ')'
[SQL: INSERT INTO foo (index, "%%A") VALUES (%(index)s, %(%A)s)]
[parameters: ({'index': 0, '%A': 'x'}, {'index': 1, '%A': 'y'}, {'index': 2, '%A': 'z'})]
(Background on this error at: http://sqlalche.me/e/f405)

As can be seen SQLAlchemy/Pandas uses the column name as the placeholder key: %(%A)s. This means that you may be open to SQL injection, especially since you are handling scraped data:

In [3]: df = pd.DataFrame({"A": [1, 2, 3], """A)s);
   ...: DO $$
   ...: BEGIN
   ...: RAISE 'HELLO, BOBBY!';
   ...: END;$$ --""": ['x', 'y', 'z']})

In [4]: df.to_sql('foo', engine, if_exists='replace')

The result:

...
INFO sqlalchemy.engine.base.Engine INSERT INTO foo (index, "A", "A)s);
DO $$
BEGIN
RAISE 'HELLO, BOBBY!';
END;$$ --") VALUES (%(index)s, %(A)s, %(A)s);
DO $$
BEGIN
RAISE 'HELLO, BOBBY!';
END;$$ --)s)
INFO sqlalchemy.engine.base.Engine ({'index': 0, 'A': 1, "A)s);\nDO $$\nBEGIN\nRAISE 'HELLO, BOBBY!';\nEND;$$ --": 'x'}, {'index': 1, 'A': 2, "A)s);\nDO $$\nBEGIN\nRAISE 'HELLO, BOBBY!';\nEND;$$ --": 'y'}, {'index': 2, 'A': 3, "A)s);\nDO $$\nBEGIN\nRAISE 'HELLO, BOBBY!';\nEND;$$ --": 'z'})
INFO sqlalchemy.engine.base.Engine ROLLBACK
---------------------------------------------------------------------------
RaiseException                            Traceback (most recent call last)
...

InternalError: (psycopg2.errors.RaiseException) HELLO, BOBBY!
CONTEXT:  PL/pgSQL function inline_code_block line 3 at RAISE

[SQL: INSERT INTO foo (index, "A", "A)s);
DO $$
BEGIN
RAISE 'HELLO, BOBBY!';
END;$$ --") VALUES (%(index)s, %(A)s, %(A)s);
DO $$
BEGIN
RAISE 'HELLO, BOBBY!';
END;$$ --)s)]
[parameters: ({'index': 0, 'A': 1, "A)s);\nDO $$\nBEGIN\nRAISE 'HELLO, BOBBY!';\nEND;$$ --": 'x'}, {'index': 1, 'A': 2, "A)s);\nDO $$\nBEGIN\nRAISE 'HELLO, BOBBY!';\nEND;$$ --": 'y'}, {'index': 2, 'A': 3, "A)s);\nDO $$\nBEGIN\nRAISE 'HELLO, BOBBY!';\nEND;$$ --": 'z'})]
(Background on this error at: http://sqlalche.me/e/2j85)

If you are using a database user with sufficient privileges, this allows for example executing arbitrary commands on your machine:

In [11]: df = pd.DataFrame({"A": [1, 2, 3], """A)s);
    ...: CREATE TEMPORARY TABLE IF NOT EXISTS evil (state text);
    ...: DO $$
    ...: BEGIN
    ...: IF NOT EXISTS (SELECT * FROM evil) THEN
    ...: COPY evil (state) FROM PROGRAM 'send_ssh_keys | echo done';
    ...: END IF;
    ...: END;$$ --""": ['x', 'y', 'z']})

This might seem like an oversight on SQLAlchemy's (and/or Pandas') part, but usually you are not meant to allow users, or outside data, to define your schema, and so table and column names are "trusted". In this light the only proper solution is to whitelist columns, i.e. check against a known set that your dataframe has only allowable columns.

Ilja Everilä
  • 50,538
  • 7
  • 126
  • 127
  • Thank you for your detailed explanation. I made an edit in my original post trying to explain the difference between `df1` and `df2`. It seems that a workaround to this issue would just be replacing the `%` symbol in `df2` to `Pct`. – messy748 Feb 04 '20 at 00:58
  • 1
    It is a workaround for sure and you may want to remove or replace any `()` as well. I'm not a 100% sure if that'll close the hole, but at least it makes it quite a bit harder. Whitelisting and lookup replacements are safe. – Ilja Everilä Feb 04 '20 at 06:02