7

Because it's so easy to install on Debian stable, I decided to use PostgreSQL 9.6 to build a datawarehouse for some data I need to process. The first step is to load the data into the database with minimal transformations, mostly correcting some known formatting errors and how booleans are represented. I have checked that these corrections are efficient: writing n rows to disk takes time proportional to n.

However, bulk loading this data using PostgreSQL's COPY FROM (no matter how; \copy, or psycopg2 copy_expert, or COPY FROM '/path/to/data.csv') takes a superlinear amount of time. The asymptotic time complexity seems to somewhat better than O(exp(sqrt(n))). This is the complexity when I already:

  • Set the isolation level to READ UNCOMMITTED, and
  • Set the primary key constraint to DEFERRED.

Here's what I'm seeing with one of the worst offenders, a 17M row table:

Square root of number of rows appears proportional to the logarithm of ingestion time when fsync is on

Disabling fsync speeds up the process by a factor 10, so I/O is obviously an enormous bottleneck. Other than that, however, the temporal behaviour doesn't change that much:

Square root of number of rows also appears proportional to the logarithm of ingestion time when fsync is off

This problem disappears entirely when I use a surrogate key instead of the business key: when I use an auto-incrementing integer column as the primary key, ingestion takes Θ(n) time again, which is what I wanted. So not only do I have a perfectly valid workaround for my problem, but I also know that the complex primary key is the culprit (the business key is typically a tuple of short VARCHAR columns).

However, I would like to understand why PostgreSQL is taking so long to ingest the data when it is keyed by the business key, so that I understand my tools better. In particular, I have no idea how to debug this ingestion process, because EXPLAIN doesn't work on COPY. It could be that sorting the data into storage takes longer for compound primary keys, or that this is due to indexing, or that the primary key constraint actually is still NOT DEFERRED; if the workaround wouldn't be so effective, or undesirable for other reasons, how can I discover what's actually going on here?

Rhymoid
  • 171
  • 4
  • 0) Tuning? 1) deferred is expensive 2) normally I copy into a temp table (with the same structure) and `insert into target_table(...) select distinct ... from temp_table where not exists(...)` 3) a surrogate as PK plus an (optional) UNIQUE constraint on the *business_key* works fine, (especially if the fat key is referenced by other tables; these could use the surrogate instead) – joop Jan 02 '18 at 17:54
  • Are executing the import in a transaction block? – Curt Evans Jan 02 '18 at 18:31
  • 2
    Try setting `synchronous_commit = off` and `autovacuum=off` (for the duration of your loading at least). Did you try batching your files into smaller chunks for COPY? Eg, 500k per file (depends on available memory, since COPY is buffering your load). – bma Jan 02 '18 at 18:49
  • 1
    drop all keys and indexes and triggers then re create them after import – Neil McGuigan Jan 02 '18 at 21:21

0 Answers0