9

reposted on serverfault

Questions

Question 1: as the size of the database table gets larger how can I tune MySQL to increase the speed of the LOAD DATA INFILE call?

Question 2: would using a cluster of computers to load different csv files, improve the performance or kill it? (this is my bench-marking task for tomorrow using the load data and bulk inserts)

Goal

We are trying out different combinations of feature detectors and clustering parameters for image search, as a result we need to be able to build and big databases in a timely fashion.

Machine Info

The machine has 256 gig of ram and there are another 2 machines available with the same amount of ram if there is a way to improve the creation time by distributing the database?

Table Schema

the table schema looks like

+---------------+------------------+------+-----+---------+----------------+
| Field         | Type             | Null | Key | Default | Extra          |
+---------------+------------------+------+-----+---------+----------------+
| match_index   | int(10) unsigned | NO   | PRI | NULL    |                |
| cluster_index | int(10) unsigned | NO   | PRI | NULL    |                |
| id            | int(11)          | NO   | PRI | NULL    | auto_increment |
| tfidf         | float            | NO   |     | 0       |                |
+---------------+------------------+------+-----+---------+----------------+

Benchmarking so far

First step was to compare bulk inserts vs loading from a binary file into an empty table.

It took:  0:09:12.394571  to do  4,000  inserts with 5,000 rows per insert
It took: 0:03:11.368320 seconds to load 20,000,000 rows from a csv file

Given the difference in performance I have gone with loading the data from a binary csv file, first I loaded binary files containing 100K, 1M, 20M, 200M rows using the call below.

LOAD DATA INFILE '/mnt/tests/data.csv' INTO TABLE test;

I killed the 200M row binary file (~3GB csv file) load after 2 hours.

So I ran a script to create the table, and insert different numbers of rows from a binary file then drop the table, see the graph below.

enter image description here

It took about 7 seconds to insert 1M rows from the binary file. Next I decided to benchmark inserting 1M rows at a time to see if there was going to be a bottleneck at a particular database size. Once the Database hit approximately 59M rows the average insert time dropped to approximately 5,000/second

enter image description here

Setting the global key_buffer_size = 4294967296 improved the speeds slightly for inserting smaller binary files. The graph below shows the speeds for different numbers of rows

enter image description here

However for inserting 1M rows it didn't improve the performance.

rows: 1,000,000 time: 0:04:13.761428 inserts/sec: 3,940

vs for an empty database

rows: 1,000,000 time: 0:00:6.339295 inserts/sec: 315,492

Community
  • 1
  • 1
Ben
  • 1,030
  • 10
  • 23
  • 2
    I think that it may be worth asking this at http://Serverfault.com, since SO is programming related and you seem to look for hardcore admin experience. +1 for really doing your homework so far though, excellent question. – Michael Stum Oct 02 '11 at 09:57

0 Answers0