1

Currently I'm testing out TokuDB and I'm very impressed. At this moment the inserts per second have peaked to just over 50.000 per second with two simultanious jobs running. The average insert rate is between 38.000 and 42.000 inserts per second.

I would like to go even higher, 100.000 inserts per second, as I will need to insert 1.2 billion calculated rows for now and about 6 billion more in the near future. I would like some advise on how to achieve this :-)

My current setup:

  1. Hardware: VPS with 4GB of RAM, 150GB SSD, 2 cores: Intel Westmere E56xx/L56xx/X56xx (Nehalem-C) 2.59GHz CPU
  2. Disk mount options: defaults,noatime
  3. OS: CentOS 6.8 64bit
  4. Database: Percona Server 5.7.14-8

My.cnf settings:

# TokuDB #
tokudb_cache_size = 2G
tokudb_commit_sync = 0
tokudb_fsync_log_period = 1000

TokuDB table layout:

CREATE TABLE `t1` (
  `id` int(15) NOT NULL AUTO_INCREMENT,
  `m_id` int(11) NOT NULL,
  `c1` decimal(6,2) DEFAULT NULL,
  `c2` decimal(6,2) DEFAULT NULL,
  `c3` decimal(6,2) DEFAULT NULL,
  `c4` decimal(6,2) DEFAULT NULL,
  `c5` decimal(6,2) DEFAULT NULL,
  `c6` decimal(6,2) DEFAULT NULL,
  `c7` decimal(6,2) DEFAULT NULL,
  `factor` decimal(4,2) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=TokuDB DEFAULT CHARSET=latin1

CREATE TABLE `t2` (
  `id` int(15) NOT NULL AUTO_INCREMENT,
  `v_id` int(15) NOT NULL,
  `pid` int(11) DEFAULT NULL,
  `amount` decimal(6,2) DEFAULT NULL,
  `unit` int(1) DEFAULT '0',
  PRIMARY KEY (`id`)
) ENGINE=TokuDB DEFAULT CHARSET=latin1

I'm aware of the fact that I'm not using any indexes other then the primary key index. This is due to the negative time impact the keys will have on inserting. A cluster key for each table will be created at the end of the insert job.

Additional MySQL commandline option:

SET unique_checks=OFF;

Somehow I'm not able to get this in the my.cnf.. If someone would know how then this would be greatly appreciated (currently unique_checks = off will block MySQL from starting due a unkown variable in the my.cnf).

The SQL statements are grouped in batches of 15.000. A PHP script generates the SQL statements and sends the query via mysqli_multiquery to the MySQL server:

<?PHP        
    foreach (generateCombinations($Arr) as $c) {

            $QueryBatch[] = "insert into t1 values (NULL"
                            . ", " . $record->id
                            . ", " . rand(1, 35)
                            . ", " . rand(1, 140)
                            . ", " . rand(1, 20)
                            . ", NULL"
                            . ", " . rand(1, 14)
                            . ", " . rand(1, 300)
                            . ", " . rand(1, 4)
                            . ", NULL );";
            $QueryBatch[] = "SET @t1id = LAST_INSERT_ID();";

            $cntBatch++;

            $pquery = array();
            foreach ( $c as $key => $pid){

                    if ( is_null($pid) )
                            continue;

                    $pquery[] = "(NULL, @t1id, " . $pid . ", " . rand(1, 800) . ", 0)";

                    $cntBatch++;
            }

            $QueryBatch[] = "insert into t2 values " . implode(',', $pquery) . ";";

            if ($cntBatch > 15000) {

                    $query = implode($QueryBatch);

                    if ( $mysqli->multi_query($query) ){
                            while ($mysqli->next_result()) {;}
                    } else {
                            printf("Errormessage: %s\n", $mysqli->error);
                            echo $query . "\n";
                    }

                    $cntBatch = 0;
                    unset($QueryBatch);
            }

    }
?>

Example of SQL insert statement:

insert into t1 values (NULL, 1 , 30, 100, 15, NULL, 10, 250, 2, NULL );
SET @t1id = LAST_INSERT_ID();
insert into t2 values (NULL, @t1id, 1, 750, 0),(NULL, @t1id, 1, 600, 0),(NULL, @t1id, 1, 500, 0),(NULL, @t1id, 1, 400, 0),(NULL, @t1id, 1, 300, 0),(NULL, @t1id, 1, 200, 0),(NULL, @t1id, 1, 100, 0);
insert into t1 values (NULL, 2 , 25, 95, 14, NULL, 11, 200, 3, NULL );
SET @t1id = LAST_INSERT_ID();
insert into t2 values (NULL, @t1id, 1, 600, 0),(NULL, @t1id, 1, 100, 0),(NULL, @t1id, 1, 300, 0),(NULL, @t1id, 1, 443, 0),(NULL, @t1id, 1, 521, 0),(NULL, @t1id, 1, 213, 0),(NULL, @t1id, 1, 433, 0);
[.. At least 14982 more..]
Robbert
  • 444
  • 5
  • 13

1 Answers1

2

If it were me, I would cut down the number of statements being executed, and cut down the number of commits. I'm assuming that AUTO_COMMIT is enabled, given that we don't see any BEGIN TRANSACTION or COMMIT statements.

That's a whole bloatload of individual INSERT and SET statements. At least the inserts to the child table are using the multiple row insert, not separate insert statements for each row.

If I needed this to be fast, I would

  1. generate the id values for the t1 table, and include those in the INSERT statement
  2. do away with the call to LAST_INSERT_ID()
  3. use multi-row insert for t1 (rather than separate INSERT statement for each row)
  4. use BEGIN TRANSACTION and COMMIT
  5. run a single process to perform inserts into t1 (serialize) to avoid potential contention for locks

If it were for InnoDB, I'd also do SET FOREIGN_KEY_CHECKS=0.

There's already a boatload of calls to the rand function in the code; so incrementing an integer id for t1 isn't going to move the needle. When we start, we'd need a query to get the current AUTO_INCREMENT value, or get the MAX(id), whichever...

Basically, I'd cut down the number of statements being executed, and get more work done by each statement, and more work down before each COMMIT.

Inserting ten (10) t1 rows per statement would significantly reduce the number of statements that need to be executed.

BEGIN TRANSACTION;
-- insert ten rows into t1
INSERT INTO t1 (id,m_id,c1,c2,c3,c4,c5,c6,c7,factor) VALUES
 (444055501, 1 , 30, 100, 15, NULL, 10, 250, 2, NULL )
,(444055502, 2 , 25, 95, 14, NULL, 11, 200, 3, NULL )
, ...
,(444055510, 10 , 7, 45, 12, NULL, 10, 300, 4, NULL )
;
-- batch together the t2 rows associated with the ten t1 rows we just inserted
INSERT INTO t2 VALUES
-- 444055501  
 (NULL, 444055501, 1, 750, 0)
,(NULL, 444055501, 1, 600, 0)
,(NULL, 444055501, 1, 500, 0)
,(NULL, 444055501, 1, 400, 0)
,(NULL, 444055501, 1, 300, 0)
,(NULL, 444055501, 1, 200, 0)
,(NULL, 444055501, 1, 100, 0)
-- 444055502  
,(NULL, 444055502, 1, 600, 0)
,(NULL, 444055502, 1, 100, 0)
,(NULL, 444055502, 1, 300, 0)
,(NULL, 444055502, 1, 443, 0)
,(NULL, 444055502, 1, 521, 0)
,(NULL, 444055502, 1, 213, 0)
,(NULL, 444055502, 1, 433, 0)
-- 444055503
, ...
;

-- another ten rows into t1
INSERT INTO t1 (id,m_id,c1,c2,c3,c4,c5,c6,c7,factor) VALUES
 (444055511, 11 , 27, 94, 15, NULL, 10, 250, 11, NULL )
,(444055512, 12 , 24, 93, 14, NULL, 11, 200, 12, NULL )
, ...
,(444055520, 10 , 7, 45, 12, NULL, 10, 300, 4, NULL )
;
INSERT INTO t2 VALUES
 (NULL, 444055511, 1, 820, 0)
,(NULL, 444055511, 1, 480, 0)
, ...
;

-- repeat INSERTs into t1 and t2, and after 1000 loops
-- i.e. 10,000 t1 rows, do a commit
COMMIT;
BEGIN TRANSACTION;
INSERT INTO t1 ...

LOAD DATA INFILE

Any discussion of performance of inserts would be incomplete without at least mentioning LOAD DATA INFILE.

For best performance, that can't be beat. But since we don't have the data in a file, and we don't have key values (needed for the foreign key in t2, and we've got all the calls to rand to generate the data, LOAD DATA INFILE doesn't seem to be a good fit.

spencer7593
  • 106,611
  • 15
  • 112
  • 140
  • Dear Spencer7593, thank you for your detailed answer! I understand what you are saying and agree but the thing is; I require to know the ID for t1 for the corresponding records in t2 to relate to. The ID field in t1 is an auto increment. You would suggest to remove the auto increment and create the ID within the script? This could be done as basically only one job will create the records. – Robbert Oct 07 '16 at 23:07
  • 1
    A value can be provided for an AUTO_INCREMENT column, without removing or disabling AUTO_INCREMENT. We get AUTO_INCREMENT behavior by supplying a NULL value. I'm suggesting that instead of supplying a NULL value, we generate the `id` value, and supply that value in both the insert to both `t1` and to `t2`. That's what I would do to speed it up... cut down the number of SQL statements and the number of COMMITs. – spencer7593 Oct 07 '16 at 23:13
  • With InnoDB, we configure auto_increment behavior to get *consecutive* auto_increment values. So multi-row insert, we know first row is LAST_INSERT_ID()+0, second row inserted is LAST_INSERT_ID()+1, up to LAST_INSERT_ID()+ROW_COUNT(). If I had to use AUTO_INCREMENT values, I'd batch the inserts to `t1` (mutli-row insert), `SELECT LAST_INSERT_ID() INTO @lid` and then my inserts to `t2` would use the `@lid+0` for rows related to first t1 row inserted, `@lid+1` for rows related to second t1 row inserted. (My caution is that I'm not sure AUTO_INCREMENT behavior with tokudb is the same as InnoDB.) – spencer7593 Oct 07 '16 at 23:22
  • I've acted upon your suggestions and currently the insert speed is between 120.000 and 145.000 records per second. However, since I've just upgraded my VPS to 4 cores and 8GB of ram I also increased the tokudb_cache_size to 5GB. I would like to know if we could go even higher (200.000+/s) :-) – Robbert Oct 07 '16 at 23:51
  • Some HW specs: CPU utilization is at a total of 25%, disk IOPS are at 40 and can boost to 1000 IOPS. – Robbert Oct 07 '16 at 23:54
  • If tokudb allows you to disable foreign key checks (like InnoDB does), you might be able to run *two* different database connections to do the inserts simultaneously... one connection blasting at the `t1` table, the other blasting at the same time to the `t2` table. Test with binary logging disabled (sql_log_bin). I strongly suspect that one of your big bottlenecks was the individual statements, all the overhead to do the parsing, perform the syntax check, perform the semantic check, prepare an execution plan, and execute the plan, obtain locks, generate rollback, write to the log... – spencer7593 Oct 08 '16 at 02:20