What is the fastest way to load data into a new CitusDB instance?

Question

I'm following the instructions on Scaling Out Data Ingestion, with this command:

find . -type f | xargs -n 1 -P 320 sh -c 'echo $0 `copy_to_distributed_table -C $0 table_name`'

My cluster has a master and eight workers, each with two SSDs. The table is spread across 320 shards.

Data loading is taking a very long time. The average insertion rate seems to be about 750k per minute. Is that normal or is there a way to speed it up?

The only thing I can think of is that I have replication enabled. Should that be turned off for loading and then reset?

score 1 · Accepted Answer · edited May 25 '16 at 21:37

I assume that you want to use hash partitioning. If that is the case, we're deprecating copy_to_distributed_table in favor of distributed COPY. COPY provides a native PostgreSQL experience, resolves several known issues, and improves ingest performance by more than an order of magnitude. This is now available as of Citus 5.1, which was released this month and is available in the official PostgreSQL Linux package repositories (PGDG).

What is the fastest way to load data into a new CitusDB instance?

1 Answers1