Is there a GUID alternative for distributed key generation?

Question

My situation is :

I have a number of client applications, which is using local DB (MS SQL, MS Access - sorry, this is Enterprise system, I have to support legacy...)
I don't know anything of trend among clients - now it's ~10 but it may be ~100 in a year.
Data from those tables comes to my central server and is put into one common table
Sometimes existing (client) data is changed - I have to perform update/delete operations
I don't want use GUID's (.NET type System.Guid) - It's hard to simply implement and support on MS Access. Besides, it's not good for performance
I need a fast search on that common table, so it would be nice to use int or long int as a PK

So, I want:

Something unique to avoid collisions (it will be used as a PK)
It should hopefully be int or long int
Must be assignable client-side before being inserted

My current solution is to take the CRC from a concatenation of:

ProcessodID
Bios date
User name (strings, hardware\user related data)
DateTime.Now (UNC)

Currently it works for me, but maybe there is a better approach to achieve my goals? Any comments, suggestions, examples, or experience of your own?

UPDATE : synchronization between client and server is periodic action, so it can occurs 2-3 times per day (it's config variable)

Could you generate a GUID and then take the first or last x bits to make an int/long? — Tim S., May 29 '12 at 20:27
Yes, it's good - but only for one client application. I have a few with different locations\workstations, so this is not my case.. — Andriy Zakharko, May 29 '12 at 20:28
@TimS. For a DB primary key you need no collisions, not just a low % of collisions. I agree that identity makes the most sense for a PK. — Servy, May 29 '12 at 20:28
If the table is so large that you have a performance problem with GUID's than you must certainly will have a problem with collisions if you are using a 64 bit int and a CRC checksum alg. for the pk. Have you determined that GUID's really is a performance problem or are you only guessing. — Magnus, May 29 '12 at 20:30
Andriy, have you considered sequential GUID's (as in NEWSEQUENTIALID())? The linked article also mentions that as a good clustering key option. I have had success with them in terms of insert/update/select performance. — Alan, May 29 '12 at 20:33
Are these existing apps that you're aggregating, or something you're developing now? If you're developing (or updating), as is implied by changing the DB structure, there's absolutely no reason to use MS Access DBs when you can be using embedded SQLite/SQLCE/etc. — Avner Shahar-Kashtan, May 29 '12 at 20:34
If data from multiple tables comes to one central table and you need to address changes to these records then my suggestion is to use two columns as PK of you central table. One column could be the Identity field from clients (not unique) and one column could be a client code (not unique) assigned from you to your client apps. The aggregate from ID and client code will be your PK — Steve, May 29 '12 at 20:37
@Alan - I don't think that would work since you have multiple clients creating the IDs. — hatchet - done with SOverflow, May 29 '12 at 20:38
@Avner Shahar-Kashtan - Unfortunatly, I have support Access, no way out. I may change DB structure, I may start develop new client (actually I'm doing it and use MS SQL Express), but.. — Andriy Zakharko, May 29 '12 at 20:38
@Steve - I was just going to post that as an answer but decided not to since I don't know if data from one client is then moved out to the other clients. It would complicate doing the ClientId+ClientIdentity idea in that case. — hatchet - done with SOverflow, May 29 '12 at 20:40
@hatchet: Granted, it increases the probability of collisions, but I still doubt that'd realistically become a problem with only 10-100 clients. — Alan, May 29 '12 at 20:41
@hatchet right but I think improbable, and for a such a special case I will prefer to work outside of the programming environment, but waiting a clarification from Andry — Steve, May 29 '12 at 20:46
What about Guid.ToString()? It might not be great for search performance, but it might be good enough. — PhilB, May 29 '12 at 21:08
Do the clients and common server table to have the same key at all times? Or is there enough uniqueness amoung your client side columns? Do you need to publish update/changes from one client to another? I ask, because the simpliest solution, under what you have currently described, seems to be to let the server assign the "master key value" when it gets new data, and just use the normal publish of "new" data to the clients. Updating rows going out will already have a key. Incoming rows that are in the master will already have a value. The only trick is on the source client... — StarPilot, May 29 '12 at 21:32
The original source client may need to spot it already has that row (being the source), and therefore can just update its source row with the published "master db key". Otherwise, it can simply discard its processed row for the incoming master update. If the master key is being generated purely at collection/processing time by the master common table, you can just use an autoincrementing PK. Again, it depends on your requirements. — StarPilot, May 29 '12 at 21:35
@StarPilot - clients don't know anything about each other, only about server. Server looks as some kind of "brain", mediator (GoF patterns). And there is also one moment I have to describe (maybe update question) - synchronization is periodic action, so it can occurs 2-3 times per day (it's config variable) — Andriy Zakharko, May 29 '12 at 21:38

score 2 · Accepted Answer · answered May 29 '12 at 21:14

2

If data from multiple tables comes to one central table and you need to address changes to these records then my suggestion is to use two columns as PK of you central table. One column could be the Identity field from clients (not unique) and one column could be a client code (not unique) assigned by you to your client apps. The aggregate from ID and client code will be your PK

This solution has the advantage to not require any changes on the client side apps (perhaps some identity code to send to your central server where you could use for some security measure) Of course, if the customer base grows (hopefully) you need to keep a centralized table of code assigned to each client. The search on the central table should not be a problem because you are using two numbers (or short string for the identity code).

answered May 29 '12 at 21:14

Steve

213,761
22
232
286

+1 This a good solution *as long as* the client id is also stored with every cross referenced key. That is, you can never refer to a record by just its client-side key, or else that key will become useless when synced to the central DB. – dlras2 May 29 '12 at 21:23
@DanRasmussen Sorry, but I don't get it, the client is unaware of the PK used on the central table. It sends its update with its identity code and its primarykey ID, it is the app that manages the central table that should find the correct record and update (or create) – Steve May 29 '12 at 21:27
Actually, you could just assume that every relational key only relates to other rows from the same client, and you would be safe. – dlras2 May 29 '12 at 21:29

score 1 · Answer 2 · answered May 29 '12 at 20:28

1

You can always just add a PK column that is a number, and (depending on the DB you are using) setup a sequence or identity to handle it.

You could also create an index over the multiple columns your currently have to help speed up your searching.

answered May 29 '12 at 20:28

Limey

2,642
6
37
62

This does not allow keys to be assigned client-side. (A requirement OP clarified in the comments.) – dlras2 May 29 '12 at 21:06

dlras2 · Answer 3 · 2012-05-29T20:45:38.847

1

You could implement a key table.

Basically, the key table is just a table with one record: the next available integer key. When your program needs to generate a key, it increments the key in the key table. It has now reserved the previously available key. It can assign that key to whatever it likes and will be assured that it will not conflict with any other key pulled in the same manner. This is how the LightSpeed ORM works by default. Its benefit over using the built-in identity column is that you can assign item ids before inserting them to the database, and therefor can define item relationships before inserting all items at once.

If you're worried about conflicts, you can lock/unlock the key table before reading and incrementing the next available key. If you know you need to insert more than one item at a time, you can increment the key by that much instead of by one multiple times. If you are guessing that sometime in the future of your application you will need a certain number of keys, you can reserve a block of sequential keys, and keep track of them internally. This may possibly waste keys if you don't assign them all, but prevents excessive calls to the database.

edited May 29 '12 at 20:45

answered May 29 '12 at 20:40

dlras2

8,416
7
51
90

2

This assumes that the clients are connected. They may not be when they create a row. – hatchet - done with SOverflow May 29 '12 at 20:41
Actually, this may be a solution, but not better than my current. It your case insert will be dramatically slow (I'm using WCF to connect clients applications and server side, so it depends on) – Andriy Zakharko May 29 '12 at 20:44
1

@hatchet The clients don't need to be always connected. They can reserve a block of keys to pull from, then they can create that many rows before needing to connect again. Even if they can't connect, they can use temporary ids until they can reserve the necessary ids from the table. – dlras2 May 29 '12 at 20:47
@Andriy.Zakharko Why will insert be dramatically slow? You can reserve a large block of keys to use, so you don't need any extra database calls when inserting values. It's also guaranteed not to conflict - something your current method can't actually do. – dlras2 May 29 '12 at 20:49
While workable this will end in a nightmare from a concurrency handling point of view. – Steve May 29 '12 at 20:50
@Steve If the clients are generating 10-100 records in a year (I could have terribly misunderstood point 2) one should worry much more about uniqueness over concurrency. – dlras2 May 29 '12 at 20:53
@DanRasmussen I think that ~10 ~100 numbers are the clients not the records – Steve May 29 '12 at 20:55
@Dan Rasmussen - according your approach, client side would not be able to insert local record without asking server about next PK value, would it? If I got you right, next available value will be provided by server? In this case client will be fail in case some temporary connection issue – Andriy Zakharko May 29 '12 at 20:58
@Andriy.Zakharko could you explain if the solution, proposed in the question comments from me and hatchet, is workable? – Steve May 29 '12 at 21:00
@Andriy.Zakharko Your client, when it's connected to the database, could reserve `N` sequential key values. It can then disconnect from the database and use these keys, safe in the knowledge that they will not be used by any other client. If you need to generate record `N+1`, you will need to reconnect and reserve more keys. You can adjust your `N`, however, so that this occurs infrequently. Using an `N` of 1 would be, as you suggest, very inefficient. – dlras2 May 29 '12 at 21:05
@Steve - yes, your solution looks really the best, so please add it as an answer – Andriy Zakharko May 29 '12 at 21:06
@Dan Rasmussen - yes, it's possible, but let face it - this is a complicated solution : I have to implement additional logic, keep track it.. it puts more places in the code (server and client) to make a bottle neck. Anyway, thanks for propose – Andriy Zakharko May 29 '12 at 21:09

Seph · Answer 4 · 2012-05-29T21:14:06.373

Just use a int64 key, on the client side use negatively incrementing numbers away from 0 (starting at -1) and then on the server side after the sync use positive incrementing numbers, when you sync data from the client to the server you then just return the new positive server side numbers. I trust updating the key on a record is simple enough, if not then just delete the old and insert a new one with the values.

This is a really simple way to have client side unique keys and updateble data without really needing to worry about the problems caused by your solution which at very best will just be randomly clashing depending how large your crc check is.

If you're dead set against using a GUID (MSDN for System.Guid.NewGuid() indicates that MAC address as part of the value and make them highly unique) then you're answer is either a composite key (and NOT one based on a crc of a composite key!); Please do not fall into the mistake of thinking that your CRC solution is less likely for a collision than a GUID unless you have more entropy than a 128 bit GUID, you're only making more work for yourself.

As I point out, there's the whole negative spectrum of int64 space that you could use to identify unsynchronised and hence temporary unique ID numbers. You then even get the added potential benefits of a clustered index when writing to the central server.

So suppose your client side data keys look like this:

-5, -4, 76, 78, 79

that means -4 and -5 need to be inserted into the central location and then get their ids updated to the new value (likely 80 and 81 in this example).

If you want further reading on the uniqueness on GUIDs and the likelihood of a collision I suggest you read Eric Lippert's recent blog posts on the topic. And as for the Access side, you can just .ToString() and convert them over to a text key. But since I have provided a solution that would work for int / int64 space this would not be required.

Updating all those keys is going to be a mess... especially if other rows are using them to as relational keys. — dlras2, May 29 '12 at 21:19
@DanRasmussen If you're using SQL Server you just use `ON UPDATE CASCADE` and it will do all the work for you. If you're using a ORM many support cascading to all defined relational keys. If neither of those then it's still not going to be that complex to update the foreign key relations manually. In any case, I think you significantly overestimate how much work it would be to update all the keys. — Seph, May 30 '12 at 10:10
The key benefit of this approach is for when you have more than one table you want to sync, if you have 10 tables, each with a key you want to sync then SQL will cascade the new IDs for you and you just do one update per key and forget about it. — Seph, May 30 '12 at 10:10

Is there a GUID alternative for distributed key generation?

4 Answers4