10

Today I've encountered one of the strangest things with MySQL I've seen. I have a trivial table:

CREATE TABLE `features`
(
  `feature_id` mediumint(6) unsigned NOT NULL AUTO_INCREMENT,
  `feature_name` varchar(100) CHARACTER SET latin1 COLLATE latin1_general_cs NOT NULL,
  PRIMARY KEY (`feature_id`),
  UNIQUE KEY `feature_name_key` (`feature_name`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci;

I am inserting data inside with Java and mysql-connector-java-5.1.15 library. Data in feature_name may duplicate and I want just unique values. I may use INSERT IGNORE but in case data is too long I may overlook it so I use this:

pstmt = conn.prepareStatement( 
        "INSERT INTO features (feature_name) VALUES (?)" );

for ( String featureName: data4db.keySet() )
{
    pstmt.setString(1, featureName );

    try
    {
        pstmt.executeUpdate();
    }
    catch ( SQLException se )
    {
        if ( se.getErrorCode() == 1062 ) // duplicate entry
        {
            continue; // ignore
        }

        throw se; // do not ignore anything else
    }
}

After data has been inserted into db I've noticed that there were some problems I've not even expected. There are roughly 4000 records in above table which is ok. The only problem is that some data could not be inserted due to duplicate primary key so I've looked inside how auto inc values look like for this table. It turns out that for most of data next adjacent rows' id was incremented by 1 as expected. For reason I do not know sometimes feature_id was incremented by 3, 5, 1000, 100000 - completely random value. Hence I've 'run out of place in this table' since it could not be inserted once id reached max val for medium int.

How can this happen? Has anyone encountered sth similar? It is worth to say there was only one program with one thread writing to this table. I' have one more table almost identical - column widths and names are different. For this one there is similar problem.

BTW - some more data:

mysql> show global variables like 'auto_inc%';
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| auto_increment_increment | 1     |
| auto_increment_offset    | 1     |
+--------------------------+-------+
2 rows in set (0.01 sec)

mysql> show global variables like 'ver%';
+-------------------------+------------------------------+
| Variable_name           | Value                        |
+-------------------------+------------------------------+
| version                 | 5.5.10                       |
| version_comment         | MySQL Community Server (GPL) |
| version_compile_machine | x86                          |
| version_compile_os      | Win32                        |
+-------------------------+------------------------------+

Thank you for any hints in advance.

Artur
  • 7,038
  • 2
  • 25
  • 39

2 Answers2

14

It's normal MySQL behavior. What happened is the following: you inserted the data up to auto_increment key 3 then you got duplicate key since your feature_name_key is defined unique. Thing is, MySQL will "waste" integer 4 and will move on to the next one, it won't reuse integers that had failed write due to key constraint.

If you had something like this:

PK | feature_name_key
1 | key1
2 | key2
3 | key3
4 | key1 (fails due to constraint, 4 is not going to be used for next successful insertion, hence the gaps in your primary key)

then you lose on integers available for primary key / auto_increment. Rethink your strategy while inserting or constructing the table to hold your data.

Michael J.V.
  • 5,499
  • 1
  • 20
  • 16
  • You are right. Thank yoy for swift and very useful lesson. I've double checked it. So essentially number that is a difference between ids in 2 adjacent rows tells how many duplicate insertion attempts there were in between these 2 rows. – Artur Apr 19 '11 at 09:19
  • Can autoincrementing when duplicate found be somehow disabled? Is there a workaround.? I just want to have a table with id and some unique string assigned to it with no id gaps but do not want to use insert ignore (it has the same problem with autoinc) nor read the whole table before inserting to check which values are already there? A procedure? – Artur Apr 19 '11 at 09:28
  • There are many workarounds, but the question is - why do you want sequential IDs? You're using surrogate key here when natural key would be much better choice from theory point of view. You can have your feature_name_key as primary key (hence unique constraint is enforced immediately) and you can create a trigger to update column called sequence_id and then you can have integers assigned sequentially without gaps to your feature names. – Michael J.V. Apr 19 '11 at 09:41
  • I'll try triggers. The reason I use surogate key is simple. Rows from the table are referenced by another table that has currently over 32 million records and each of them refers to one of rows in features table. Thank you for hints. – Artur Apr 19 '11 at 10:11
  • If you're using surrogate keys for some sort of foreign key reference, why does it matter that they have to be in sequential order? Why not just increase the range to int or even bigint and let the db do its work so you don't waste time? You might even compromise the integrity of your data. – Michael J.V. Apr 19 '11 at 10:23
  • Yes they do not need to be in sequential order. 1) it just look better if they are 2) it allows me to use smaller int for this column which as a result keeps db faster and smaller. 3) since in each data set I want to pump into features table there are maybe 30 new unique values among thousands I have each time - after each such bulk insert I would skip thousands values in id column. I'll try to first insert all values (ignoring duplicates) and later assign unique ids before I reference them from other tables. It should work well. – Artur Apr 19 '11 at 10:51
  • Looking nicer is subjective, I doubt computers will complain about how something looks :) also, do you really think that 1 byte difference between mediumint and int will do much difference when it comes to performance and storage consumption? On 32 million records, it's less than 31 megabyte of data, not a huge impact considering what you lose if you want to tamper with the data. – Michael J.V. Apr 19 '11 at 10:55
  • On second thought (I am not db expert at all - I am really far from there) have to admit changing mediumint to int unsigned is very good solution because: 1) indeed +32MB for data +some for index is not that bad 2) 32bit uint is is native size (4 most oses) hence operations are fastest comparing to 3bytes) 3) each 'bulk' insert into features may cause id to jump by ~2000 in worst case 4) fast calculations show I may 'overflow' 32bit uint within next 36 years 5) no changes to my soft are needed but few alter tables on db - Thank you - I really appreciate you've spent time answering – Artur Apr 19 '11 at 11:57
  • can you please specify what the constraint is? my id jumped from 37 to 485995 for I dont know why – Trisno Raynaldy Oct 28 '22 at 08:53
0

Between your insertions, have you deleted rows? MySQL might simply be remembering the autoincrement counter.

Vincent Mimoun-Prat
  • 28,208
  • 16
  • 81
  • 124