4

When I importing data to a new table from a large excel, if one record failed, then nothing is imported. I think it's ok because it meet the Atomicity rule. However, when I fixed the source data error and import again, the identity column does not start from 1, but start from a big value.

For example

create table #test (id int identity(1,1), name varchar(4) default '')

insert into #test (name) values('1 insert will failed');
select ident_current('#test') as ident_current
insert into #test (name) values('2 insert will failed');
select ident_current('#test') as ident_current

insert into #test (name) values('3 OK');
select ident_current('#test') as ident_current

select * from #test

drop table #test

Result

id          name 
----------- ---- 
3           3 OK

Wikipedia descripbe ACID as the following

Atomicity

Atomicity requires that each transaction is "all or nothing": if one part of the transaction fails, the entire transaction fails, and the database state is left unchanged. An atomic system must guarantee atomicity in each and every situation, including power failures, errors, and crashes.

So, it looks like SQL Server doesn't let the database state (the identity value) unchanged if insert failed, so, does this break the ACID rule?

BTW, PostgreSQL doesn't let identity(serial) value grows when insert failed. (Update: Only sometimes, see comments. DO NOT rely on this.).

test=# create table AutoIncrementTest (id serial not null, name varchar(4));
NOTICE:  CREATE TABLE will create implicit sequence "autoincrementtest_id_seq" for serial column "autoincrementtest.id"
CREATE TABLE
test=# insert into autoincrementtest(name) values('12345');
ERROR:  value too long for type character varying(4)
test=# insert into autoincrementtest(name) values('12345');
ERROR:  value too long for type character varying(4)
test=# insert into autoincrementtest(name) values('1234');
INSERT 0 1
test=# select * from autoincrementtest;
 id | name
----+------
  1 | 1234
Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
LiuYan 刘研
  • 1,614
  • 1
  • 16
  • 29
  • hmm what happens in PostgreSQL when two inserts happen concurrently and the first one gets rolled back? – Conrad Frix Jul 19 '12 at 16:18
  • 2
    SEQUENCEs in PostgreSQL are not bound by Atomicity rules, either. It only happens here because the new value fails validation *before* the INSERT needs to fetch a sequence value. If you also had a UNIQUE constraint on the name column, then tried to insert the same name twice (second insert fails), then insert a different name, you should end up with two records, but with a "gap" in between the ID numbers. This is done for performance reasons. If you MUST have "gapless" sequences you WILL pay a performance penalty, as each INSERT will need to get an exclusive lock on the sequence counter. – Matthew Wood Jul 19 '12 at 16:30
  • @MatthewWood this is similar for SEQUENCEs in SQL Server 2012. – Aaron Bertrand Jul 19 '12 at 16:55
  • @MatthewWood, thanks for the explanation about PostgreSQL Sequence. I did a test, and the gap did exist. `ALTER TABLE AutoIncrementTest ADD CONSTRAINT UQ_Name UNIQUE (name); insert into autoincrementtest(name) values('123'); insert into autoincrementtest(name) values('123'); insert into autoincrementtest(name) values('1233');`, gives the following result: `id=2,name=123; id=4,name=1233` – LiuYan 刘研 Jul 19 '12 at 16:59
  • Also everyone should be aware of [this bug in SQL Server 2012](http://connect.microsoft.com/SQLServer/feedback/details/739013/alwayson-failover-results-in-reseed-of-identity), which can cause unexpected gaps. And no, the bug is not *that there are gaps*, since IDENTITY does not guarantee to be free of gaps, but that the gaps are larger than expected. – Aaron Bertrand Jul 19 '12 at 17:01
  • Just to clarify @MatthewWood 's comments, you can't use a `SEQUENCE` if you need gaplessless sequences in PostgreSQL, because some failures only occur at commit time (eg serialization failures). You must use an ordinary one-row table where you `SELECT ... FOR UPDATE` or `UPDATE ... RETURNING` the value. This will block other transactions from proceeding past that point until the one with the lock commits or rolls back, so make sure transactions relying on gapless sequences are really short. If at all possible fix your design so it doesn't need gapless sequences. – Craig Ringer Jul 20 '12 at 14:08

3 Answers3

5

Since the identity value is not something that's physically stored in any part of the database that you can access, I disagree that this breaks atomicity. If you don't want to "break atomicity", or if you care about gaps (you shouldn't), there are other ways to do this (e.g. use a serializable transaction and take MAX(col)+1 for the new row).

Aaron Bertrand
  • 272,866
  • 37
  • 466
  • 490
  • You can access the value using DBCC CHECKIDENT and must obviously be physical stored in the server somewhere http://msdn.microsoft.com/en-us/library/ms176057.aspx – Jodrell Jul 19 '12 at 15:46
  • @Jodrell yes, it is stored, but please let me know where you can find it without using system functions. I can also execute xp_cmdshell and xp_reg scripts from within the database, does that make the file system and registry part of atomicity too? – Aaron Bertrand Jul 19 '12 at 15:49
  • 1
    I don't believe identity columns are part of standard SQL so that would be challenging. Either way, I agree that you shouldn't rely on contiguous values and therefore the atomicity is moot. – Jodrell Jul 19 '12 at 15:56
  • 1
    @Jodrell In other words, the identity value that is generated is not part of *your data* until it has been successfully committed to your table. An `IDENTITY` value is a meaningless, surrogate key. If you generate a `NEWID()` value and do nothing with it, do you expect it to be written somewhere so that it is the next one used by some other transaction, or would generating a different `NEWID()` value also break atomicity? What about trying to roll back a call to `sp_send_dbmail` or `NET SEND`? You should consider the generation of `IDENTITY` values to be external, for all intents & purposes. – Aaron Bertrand Jul 19 '12 at 15:56
  • I agree with you but, from a purist view all your examples would break atomicity. I don't suggest that should change. It's better that MS SQL is extended in this way without the burden of some distributed transaction system. – Jodrell Jul 19 '12 at 16:06
  • 1
    Maybe I'm some kind of paranoiac, the gaps let me feel that the datas are dirty. :S – LiuYan 刘研 Jul 19 '12 at 16:28
  • @LiuYan刘研 then STOP using IDENTITY columns. Either use a much more expensive and less concurrent method as I suggested, or use a wider key (e.g. GUID) where gaps are irrelevant, or reconsider why you think they're dirty. Do the actual values really matter? Why? What meaning are you placing on the values? – Aaron Bertrand Jul 19 '12 at 16:30
  • Well, I often treat identity value as row number of a record (like a revision number of version control system), but if gaps (especially large gap) occured, then I can't tell how many records had inserted when I saw the last row. And, the gap may limit the total records number, I can't insert 65535 records if identity column type is `smallint`(short) and there's 10000 value gap on the identity column unless I reuse the value in the gap. – LiuYan 刘研 Jul 19 '12 at 17:18
  • @LiuYan刘研 then like I said, stop using IDENTITY for this. – Aaron Bertrand Jul 19 '12 at 17:19
  • `MAX(col)+1` can be used programmatically, But when using Import & Export tools, I can't control the column value to be `MAX(col)+1`. So, it's a little pity if don't use IDENTITY. :( – LiuYan 刘研 Jul 19 '12 at 17:57
  • @LiuYan刘研 since SQL Server doesn't work exactly as you want it to, you need to make a choice, sorry. – Aaron Bertrand Jul 19 '12 at 17:57
  • @Jodrell but that wouldn't satisfy the requirement. That shows a sequence without gaps, but does not indicate when a row goes missing. – Aaron Bertrand Jul 20 '12 at 11:22
3

Yes it does, so don't rely on contiguous values with MSSQL Server.

I would suggest that relying on contiguous identity values per-se, with any engine is a brittle and naive approach. This could always occur as the result of subsequent deletes.

I suppose this deviation from purist ACID compliance allows a performance optimization in MS SQL Server.

Jodrell
  • 34,946
  • 5
  • 87
  • 124
  • Not sure if the deviation can optimize performace, but I think SQL Server should have some reasons to do like this. – LiuYan 刘研 Jul 19 '12 at 16:15
  • 2
    @LiuYan刘研 the reason is precisely for performance. SQL Server doles out the next value and doesn't care whether it ultimately gets used. This way the next user doesn't have to wait to see if the previous value was used or not before determining what the "next next" value should be. In a highly concurrent system, this is going to be a lot more important than gaps. – Aaron Bertrand Jul 19 '12 at 16:56
1

Atomicity guarantees, according to this formulation, that the database state is left unchanged. The question is what we mean by the database state.

As long you understand the SQL concept of "identity insert" to neither claim nor guarantee that identity columns will be sequential, there is no issue. It does require a rethinking about what SQL guarantees when you consider identity insert, but since we know that this can fail in the case mentioned, it wasn't ever really guaranteed to be the NEXT value.

Before the insert, the 'next' value of the identity column is only guaranteed to be greater than the current value - not that it is the next value. This is still the state afterwards.

David Manheim
  • 2,553
  • 2
  • 27
  • 42