39

I've been working with importing large CSV files of data; usually less than 100,000 records. I'm working with PHP and MySQL (InnoDB tables). I needed to use PHP to transform some fields and do some text processing prior to the MySQL INSERTs (part of process_note_data() in code below). MySQL's LOAD DATA was not feasible, so please do not suggest it.

I recently tried to improve the speed of this process by using MySQL transactions using START TRANSACTION and COMMIT. The performance increase was surprising. Processing time(s) dropped by a factor of 20. So, a 20 minute process only took about 1 minute.

QUESTIONS.

1.) Does anyone understand why there was such performance increase (20 mins to 1 min)?

2.) Should I be concerned about how big the transaction may get with 100,000 records?

3.) Should I be concerned with a large number of inserts and/or updates in the transaction?

/*
 * Customer Notes Data:
 * Rows are either a meeting, call or note!
 */
$row = 1;
$data = array();
$fields = array();
$line = '';

$db->query('SET autocommit=0;');
$db->query('START TRANSACTION;');

if (($handle = fopen("modules/".$currentModule."/Data/customernote.csv", "r")) !== FALSE) {
  while (($data = fgetcsv($handle, 4096, ',', '"')) !== FALSE && $row < 999000) {
    //Row 1 - CSV header row with field names
    if ($row == 1) {
      $csv_fields = $data;
    } elseif ($row > 1) {
      $fields = $this->process_note_data($data, $csv_fields, $row);
    }
    $row++;
  } // end while
  fclose($handle);
}

$db->query('COMMIT;');
$db->query('SET autocommit=1;');

Note: The text/field processing is done in the call to $this->process_note_data() which then calls another helper class that has the INSERT statement code. I didn't have enough room to include all of the code. $db->query() is a typical database object for MySQL queries.

jeremycole
  • 2,741
  • 12
  • 15
jjwdesign
  • 3,272
  • 8
  • 41
  • 66
  • 2
    I don't see any update/insert here, maybe removing these resulted in faster execution ;) – dev-null-dweller Feb 03 '13 at 17:35
  • The text/field processing is done in the call to $this->process_note_data() which then calls another helper class that has the INSERT's. I didn't have enough room to include all of the code. Note, it is properly inserting all records. – jjwdesign Feb 03 '13 at 17:38
  • 2
    transaction ~= `in memory processing`; commit ~= `flush to storage` – जलजनक Feb 03 '13 at 17:44
  • 3
    You don't need to (and shouldn't) `SET autocommit=0;` and back again; from [here](http://dev.mysql.com/doc/refman/5.0/en/commit.html): *"With START TRANSACTION, autocommit remains disabled until you end the transaction with COMMIT or ROLLBACK."* – BlueRaja - Danny Pflughoeft Jun 11 '13 at 20:11

2 Answers2

27
  1. Please check this link:

    https://dev.mysql.com/doc/refman/5.5/en/optimizing-innodb-transaction-management.html

    InnoDB must flush the log to disk at each transaction commit if that transaction made modifications to the database. When each change is followed by a commit (as with the default autocommit setting), the I/O throughput of the storage device puts a cap on the number of potential operations per second.

  2. Big transactions may affect performance during commit (check above)

  3. Only in case of rollback, however it may be optimized using some settings (check the link)

Simon East
  • 55,742
  • 17
  • 139
  • 133
MiGro
  • 1,471
  • 10
  • 8
  • Would you recommend COMMIT'ing every 1,000 or so INSERTS's in order to reduce the processing at COMMIT or am I worrying about this a bit too much? – jjwdesign Feb 03 '13 at 18:27
  • 1k should be ok but it depends on H/W. I would suggest to run some tests here. Anyway - be careful about data consistency (i.e. - you loaded 20k out of 100k records and the system crashed). – MiGro Feb 03 '13 at 18:32
  • There isn't much processing to be done at `COMMIT` time, so there's no reason to periodically commit for that. However if the system crashes while a very large transaction is in progress, it can require a large amount of time to rollback (and this will be done during server start while not taking requests). With just 100k rows, however you should probably not worry about that. – jeremycole Feb 21 '13 at 01:44
23

My own little test in .Net (4 fields pr. records):

INSERT 1 record, no transaction:60 ms

INSERT 1 record, using transaction:158 ms

INSERT 200 records using transactions, commit after each record:17778 ms

INSERT 200 records using no transactions:4940 ms

INSERT 200 records using transactions, only commit after last record:4552 ms

INSERT 1000 records using transactions, only commit after last record:21795 ms

Client in Denmark, server in Belgium (Google cloud f1-micro).

I meant to put this in a comment but the formatting is not good....so here is my apology in advance ;-)

MrCalvin
  • 1,675
  • 1
  • 19
  • 27