3

Setup

  • A PostgreSQL 10.6 database set up using Django Rest Framework (DRF) which will eventually take over from the legacy system. Contains hundreds of millions of rows.
  • A single PostgreSQL 10.6 remote instance with a legacy schema.

Task

Import the data from the legacy database into the Django database in a reasonable time (minutes) so that this task can be done during CI runs.

Work so far

We first created a separate Python script to use the Django API to import data. The team knew this was going to be slow, but we wanted to test the API and ended up getting some useful feedback. Inside our development environments (in a VM) this ended up creating maybe a couple dozen rows per second.

We now want to import much more data. We first moved the script into a Django admin command so that we could instantiate the DRF serializers directly rather than calling the API. This was a little bit faster.

We then tried to instantiate the DRF serializers with many=True, but this was not noticeably faster.

The current fastest implementation imports maybe 80-100 rows per second, so we decided to change tack. The importer is thoroughly tested, so we decided to try using a foreign data wrapper to import the data using raw SQL, starting with one of the small tables. The resulting Python code looks like this (dedented for readability):

connection = psycopg2.connect(dsn='[Django DB connection parameters]')
…
with connection.cursor() as cursor:
    cursor.execute('CREATE EXTENSION postgres_fdw')
    cursor.execute(
        "CREATE SERVER legacy FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '{}', dbname '{}', updatable 'false')".format(…)
    )
    cursor.execute(
        "CREATE USER MAPPING FOR postgres SERVER legacy OPTIONS (user '{}', password '{}')".format(…)
    )
    cursor.execute('CREATE SCHEMA IF NOT EXISTS legacy_schema')
    cursor.execute(
        "IMPORT FOREIGN SCHEMA public LIMIT TO (small_table) FROM SERVER legacy INTO legacy_schema"
    )
…
with connection.cursor() as cursor:
    cursor.execute(
        """
        INSERT INTO django_table (…)
        SELECT …
          FROM legacy_schema.small_table
        RETURNING id, …
        """
    )

Running this SQL manually in the PostgreSQL shell works fine:

# INSERT INTO django_table …
 id | …
----+-…
  3 | …
(1 row)

INSERT 0 1

Problem

The Python code hangs at the last execute. At this point the state in the legacy database is as follows:

# SELECT query FROM pg_stat_activity WHERE state = 'active';
                                    query                                    
-----------------------------------------------------------------------------
 DECLARE c1 CURSOR FOR SELECT … FROM public.small_table …

At this point SELECT pg_cancel_backend([pid]) doesn't do anything. After a SELECT pg_terminate_backend([pid]) the Python code continues as normal. Also, the SELECT query in the cursor above returns instantly, so it's not a simple size issue.

I've also found that after a simple START TRANSACTION the DECLARE statement above runs just fine in the legacy database, so I suspect that the problem is in Python. strace -f of the command shows an endless loop (repeating every 20/30 seconds) like this:

<... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
futex(0x5627e2e8e430, FUTEX_WAKE, 1) = 1
<... futex resumed> )       = 0
futex(0xc420063d48, FUTEX_WAKE, 1 <unfinished ...>
pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=20000}, NULL <unfinished ...>
<... futex resumed> )       = 1
<... futex resumed> )       = 0
pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=3000}, NULL <unfinished ...>
futex(0xc420062948, FUTEX_WAKE, 1) = 1
<... futex resumed> )       = 0
futex(0x5627e2e92d00, FUTEX_WAIT, 0, {tv_sec=29, tv_nsec=997730142} <unfinished ...>
futex(0xc420062948, FUTEX_WAIT, 0, NULL <unfinished ...>
<... pselect6 resumed> )    = 0 (Timeout)
futex(0xc420063d48, FUTEX_WAIT, 0, NULL <unfinished ...>
<... pselect6 resumed> )    = 0 (Timeout)
futex(0x5627e2e8e430, FUTEX_WAIT, 0, {tv_sec=60, tv_nsec=0}

Interestingly, I get a very different error if I create cursors with names. For example, if I create the first cursor as cursor(name='cd1') I get

cursor.execute('CREATE EXTENSION postgres_fdw')
psycopg2.ProgrammingError: syntax error at or near "CREATE"
LINE 1: DECLARE "c1" CURSOR WITHOUT HOLD FOR CREATE EXTENSION postgr...

Why would Python hang when creating/using a cursor for this query?

Using psycopg2-binary 2.7.6.1 from PyPI.


It looks like the issue is that there is another connection already established directly to the legacy database when this code runs, because when I avoid that it runs correctly. I'll try working around that and see what happens.

l0b0
  • 55,365
  • 30
  • 138
  • 223
  • Is the python code you has provided missing the query starting with `DECLARE c1 ...`? Or some non trivial piece of query was hidden behind the ellipsis? – Roman-Stop RU aggression in UA Nov 28 '18 at 22:29
  • @RomanKonoval Thanks, fixed. It does get to the `INSERT` statement, so all of the statements before that work fine. – l0b0 Nov 28 '18 at 22:39
  • Sorry, I still don't quite get it. The active query listed is `SELECT … FROM public.small_table …` but there's no such query in the python code. – Roman-Stop RU aggression in UA Nov 28 '18 at 22:57
  • I've imported the *foreign* "public" schema as "legacy_schema", so it will show up as "legacy_schema" in the Python code but "public" in the legacy database list of queries. – l0b0 Nov 28 '18 at 22:59
  • Again, the code does not return an error (which it would do if any of the names were wrong), and crucially it works if I copy this code (and insert the relevant values) in a Postgres shell. – l0b0 Nov 28 '18 at 23:02
  • @RomanKonoval It has the whole `INSERT` statement marked as active. I'll add a copy of that into the question when I can. – l0b0 Nov 28 '18 at 23:03
  • This looks like the issue in psycopg2. Can you [enable debug](http://initd.org/psycopg/docs/install.html#creating-a-debug-build) and get the log? – Roman-Stop RU aggression in UA Nov 28 '18 at 23:11
  • Any serials involved? [I have a (toy) postgres_fdw copying records from a twitter scraper, and I had to manually circumvent serials] – wildplasser Nov 28 '18 at 23:30
  • @wildplasser The Django table has an ID which is based on a serial. But no remote serials are involved. – l0b0 Nov 29 '18 at 00:18

0 Answers0