Setup
- A PostgreSQL 10.6 database set up using Django Rest Framework (DRF) which will eventually take over from the legacy system. Contains hundreds of millions of rows.
- A single PostgreSQL 10.6 remote instance with a legacy schema.
Task
Import the data from the legacy database into the Django database in a reasonable time (minutes) so that this task can be done during CI runs.
Work so far
We first created a separate Python script to use the Django API to import data. The team knew this was going to be slow, but we wanted to test the API and ended up getting some useful feedback. Inside our development environments (in a VM) this ended up creating maybe a couple dozen rows per second.
We now want to import much more data. We first moved the script into a Django admin command so that we could instantiate the DRF serializers directly rather than calling the API. This was a little bit faster.
We then tried to instantiate the DRF serializers with many=True
, but this was not noticeably faster.
The current fastest implementation imports maybe 80-100 rows per second, so we decided to change tack. The importer is thoroughly tested, so we decided to try using a foreign data wrapper to import the data using raw SQL, starting with one of the small tables. The resulting Python code looks like this (dedented for readability):
connection = psycopg2.connect(dsn='[Django DB connection parameters]')
…
with connection.cursor() as cursor:
cursor.execute('CREATE EXTENSION postgres_fdw')
cursor.execute(
"CREATE SERVER legacy FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '{}', dbname '{}', updatable 'false')".format(…)
)
cursor.execute(
"CREATE USER MAPPING FOR postgres SERVER legacy OPTIONS (user '{}', password '{}')".format(…)
)
cursor.execute('CREATE SCHEMA IF NOT EXISTS legacy_schema')
cursor.execute(
"IMPORT FOREIGN SCHEMA public LIMIT TO (small_table) FROM SERVER legacy INTO legacy_schema"
)
…
with connection.cursor() as cursor:
cursor.execute(
"""
INSERT INTO django_table (…)
SELECT …
FROM legacy_schema.small_table
RETURNING id, …
"""
)
Running this SQL manually in the PostgreSQL shell works fine:
# INSERT INTO django_table …
id | …
----+-…
3 | …
(1 row)
INSERT 0 1
Problem
The Python code hangs at the last execute
. At this point the state in the legacy database is as follows:
# SELECT query FROM pg_stat_activity WHERE state = 'active';
query
-----------------------------------------------------------------------------
DECLARE c1 CURSOR FOR SELECT … FROM public.small_table …
At this point SELECT pg_cancel_backend([pid])
doesn't do anything. After a SELECT pg_terminate_backend([pid])
the Python code continues as normal. Also, the SELECT
query in the cursor above returns instantly, so it's not a simple size issue.
I've also found that after a simple START TRANSACTION
the DECLARE
statement above runs just fine in the legacy database, so I suspect that the problem is in Python. strace -f
of the command shows an endless loop (repeating every 20/30 seconds) like this:
<... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
futex(0x5627e2e8e430, FUTEX_WAKE, 1) = 1
<... futex resumed> ) = 0
futex(0xc420063d48, FUTEX_WAKE, 1 <unfinished ...>
pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=20000}, NULL <unfinished ...>
<... futex resumed> ) = 1
<... futex resumed> ) = 0
pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=3000}, NULL <unfinished ...>
futex(0xc420062948, FUTEX_WAKE, 1) = 1
<... futex resumed> ) = 0
futex(0x5627e2e92d00, FUTEX_WAIT, 0, {tv_sec=29, tv_nsec=997730142} <unfinished ...>
futex(0xc420062948, FUTEX_WAIT, 0, NULL <unfinished ...>
<... pselect6 resumed> ) = 0 (Timeout)
futex(0xc420063d48, FUTEX_WAIT, 0, NULL <unfinished ...>
<... pselect6 resumed> ) = 0 (Timeout)
futex(0x5627e2e8e430, FUTEX_WAIT, 0, {tv_sec=60, tv_nsec=0}
Interestingly, I get a very different error if I create cursors with names. For example, if I create the first cursor as cursor(name='cd1')
I get
cursor.execute('CREATE EXTENSION postgres_fdw')
psycopg2.ProgrammingError: syntax error at or near "CREATE"
LINE 1: DECLARE "c1" CURSOR WITHOUT HOLD FOR CREATE EXTENSION postgr...
Why would Python hang when creating/using a cursor for this query?
Using psycopg2-binary 2.7.6.1 from PyPI.
It looks like the issue is that there is another connection already established directly to the legacy database when this code runs, because when I avoid that it runs correctly. I'll try working around that and see what happens.