We are trying to load a large table (358M records, 160GB uncompressed) file to aurora. We've tried several combinations, but it doesn't seem to do what we want it to do.
If we create primary keys and partitioned schema, and then try to bulk insert - it never completes. The insert times increase exponentially. This is true even if we pre-sort records prior to bulk insert.
If we bulk insert without primary key and indexes - it completes, but then we are unable to add the primary key later.
The Table
Customer table, with customer_id as primary key, big integer. We have four other foreign keys - let's call it fk1, fk2, fk3, fk4 - all big integers. We only want these columns to be indexed, not actual foreign keys.
In addition, we have 40 other columns - mostly varchars and some timestamps
Approach #1 : Without primary keys / indexes
Server - db.r3.large Client - m4.large, Amazon linux instance
We created the table schema WITHOUT the primary key or the indexes. We ran mysqlimport on the complete 160GB file. It took about 3 hours to load the database.
Thereafter, we ran an alter table to create the primary key. It failed with the error message "ERROR 1114 (HY000): The table is full".
We figured this was because the ephemeral disk was full (See https://forums.aws.amazon.com/message.jspa?messageID=691512#691512), and so we upgraded the database server to a db.r3.xlarge. Again we encountered the same error, and we gave up trying to create the primary key after inserting records.
Approach #2: Partitioning and Pre-sorting data
Next, we decided to partition the customer table, hoping it would help. We want to partition on FK1, so we ended up redefining our primary keys as (FK1, customer_id) to accommadate MySQL's partitioning rules.
We then sorted our input files according to FK1, customer_id, and then split the files into chunks of 10M records per file.
We created the schema, with primary key (FK1, customer_id), and partition on FK1.
Finally, we imported 36 files in a single-threaded loop. We set the following connection properties -
SET FOREIGN_KEY_CHECKS = 0;
SET UNIQUE_CHECKS = 0;
SET SESSION tx_isolation='READ-UNCOMMITTED';
Intially, each file completed in 10 minutes (i.e. a rate of 1M inserts per minute). Somewhere after a 100M records though, the processing time per file kept on increasing exponentially. The 13th file took 16 minutes, the 20th file 175 minutes, while the 21st file took around 280 minutes. We gave up at that point.
During the import process, the server metrics were kind of normal - not much CPU utilization, network IO went down. The interesting metric was that number of reads kept increasing monotonically, and the number of writes kept decreasing monotonically. Both graphs looked like step functions going in the opposite direction.
Current State
We gave up trying to create the primary key and partitioned tables. Instead, we just created a non-unique index on customer_id.
Any clue on what we can do to insert the records along with the primary key?