Bulk Import to AWS Aurora - Exponential Insert Times

Question

We are trying to load a large table (358M records, 160GB uncompressed) file to aurora. We've tried several combinations, but it doesn't seem to do what we want it to do.

If we create primary keys and partitioned schema, and then try to bulk insert - it never completes. The insert times increase exponentially. This is true even if we pre-sort records prior to bulk insert.
If we bulk insert without primary key and indexes - it completes, but then we are unable to add the primary key later.

The Table

Customer table, with customer_id as primary key, big integer. We have four other foreign keys - let's call it fk1, fk2, fk3, fk4 - all big integers. We only want these columns to be indexed, not actual foreign keys.

In addition, we have 40 other columns - mostly varchars and some timestamps

Approach #1 : Without primary keys / indexes

Server - db.r3.large Client - m4.large, Amazon linux instance

We created the table schema WITHOUT the primary key or the indexes. We ran mysqlimport on the complete 160GB file. It took about 3 hours to load the database.

Thereafter, we ran an alter table to create the primary key. It failed with the error message "ERROR 1114 (HY000): The table is full".

We figured this was because the ephemeral disk was full (See https://forums.aws.amazon.com/message.jspa?messageID=691512#691512), and so we upgraded the database server to a db.r3.xlarge. Again we encountered the same error, and we gave up trying to create the primary key after inserting records.

Approach #2: Partitioning and Pre-sorting data

Next, we decided to partition the customer table, hoping it would help. We want to partition on FK1, so we ended up redefining our primary keys as (FK1, customer_id) to accommadate MySQL's partitioning rules.

We then sorted our input files according to FK1, customer_id, and then split the files into chunks of 10M records per file.

We created the schema, with primary key (FK1, customer_id), and partition on FK1.

Finally, we imported 36 files in a single-threaded loop. We set the following connection properties -

SET FOREIGN_KEY_CHECKS = 0;
SET UNIQUE_CHECKS = 0;
SET SESSION tx_isolation='READ-UNCOMMITTED';

Intially, each file completed in 10 minutes (i.e. a rate of 1M inserts per minute). Somewhere after a 100M records though, the processing time per file kept on increasing exponentially. The 13th file took 16 minutes, the 20th file 175 minutes, while the 21st file took around 280 minutes. We gave up at that point.

During the import process, the server metrics were kind of normal - not much CPU utilization, network IO went down. The interesting metric was that number of reads kept increasing monotonically, and the number of writes kept decreasing monotonically. Both graphs looked like step functions going in the opposite direction.

Current State

We gave up trying to create the primary key and partitioned tables. Instead, we just created a non-unique index on customer_id.

Any clue on what we can do to insert the records along with the primary key?

My initial thought is you've run out of I/O credits. Are you using EBS with provisioned IOPS or regular gp2? You could load data using PIOPS, snapshot, then change to a regular gp2 SSD if you don't want to keep PIOPS. — Tim, Sep 01 '16 at 20:04
Update on my comment above - you can easily switch RDS between storage types, gp2 and piops for example. There may be a brief outage when you do this. http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PIOPS.StorageTypes.html#USER_PIOPS.ModifyingExisting — Tim, Sep 01 '16 at 20:21
@Tim I don't think Aurora entertains the concept of GP vs PIOPS. It uses "Aurora cluster storage" built from 6 (2 per AZ x 3) mirrors, and the master and replicas share this same logical volume. — Michael - sqlbot, Sep 02 '16 at 12:31

Bulk Import to AWS Aurora - Exponential Insert Times

The Table

Approach #1 : Without primary keys / indexes

Approach #2: Partitioning and Pre-sorting data

Current State

0 Answers0