3

I'm in the process of writing a Membership Provider for use with our existing membership base. I use EF4.1 for all of my database access and one of the issued that I'm running into is when the DB was originally setup the relationships were done programmatically instead of in the db. One if the relationships needs to be made on a column that isn't required for all of our users, but in order to make the relationships does need to be unique (from my understanding).

My solution that I believe will work is to do an MD5 hash on the userid field (which is unique ...which would/should guarantee a unique value in that field). The part that I'm having issues with on sql server is the query that would do this WITHOUT replacing the existing values stored in the employeeNum field (the one in question).

So in a nutshell my question is. What is the best way to get a unique value in the employeeNum field (possibly based on an md5 hash of the userid field) on all the rows in which a value isn't already present. Also, to a minor/major extent...does this sound like a good plan?

Jared
  • 5,840
  • 5
  • 49
  • 83
  • 1
    Not to comment on the problem at hand, but in general don't use MD5. You should use some flavor of SHA IMHO. http://stackoverflow.com/questions/2768913/if-md5-is-broken-what-is-a-better-solution ... you can also read about MD5 collision problems here: http://en.wikipedia.org/wiki/MD5#Collision_vulnerabilities – Aaron Bertrand Jun 08 '12 at 16:10
  • @AaronBertrand I'm aware of the differences between them/possible/collision/etc I listed MD5 due to the size of the hash, but at the end of the day I'm not really concerned about the hash used, only that a unique value is ALWAYS generated. A hash value should accomplish this when based on a userid (which is unique). We have a SMALL (90 ish) membership base, but some of them are vendors...which is what caused me to not switch to the asp membership provider. So in short I'm aware of the smaller collision domain for MD5, but I'm not all together worried about it much. – Jared Jun 08 '12 at 16:32
  • 3
    So if you're aware of the collision possibility with MD5, and are still willing to use it to generate "unique" values (which may not end up being unique), can you better describe what you mean by "best"? Also can you explain why you need a hashed value of the userid instead of just the userid? – Aaron Bertrand Jun 08 '12 at 16:38
  • @AaronBertrand the possibility of a collision with 250ish records (assuming our membership basis swells) and a hash being generated in < 50% of them AT MOST (hash only used for vendors) is quite remote. Like I said though I'm not picky on WHICH hash is used. I listed MD5 due to the smaller hash size and because the hash isn't being used for security, but instead used for it's uniqueness. – Jared Jun 08 '12 at 16:56
  • If I understand you correctly, you need a unique value in `employeeNum` because you want to use that column as a foreign key in another table. Move all UserID values to employeeNum column (except the ones already have valaues) and create a unique constraint on employeeNum. After that you can create foreign key constraints on `employeeNum`. – Mikael Eriksson Jun 08 '12 at 16:57
  • @AaronBertrand To answer they why the hashed instead of the userid. This is mainly because we ftp documents to our server and then those match the employee#'s that use the site. While the possibility of of a userid just happening to also be an employee# is remote with a hash I don't have to worry about it – Jared Jun 08 '12 at 17:01
  • @MikaelEriksson This would most likely work, but I don't want to chance a userid incorrectly matching an employee#. Since generating a hash vs transposing data is not very difficult (I don't believe?) I'd just assume use a hash of some form to keep that possibility at practically nothing. – Jared Jun 08 '12 at 17:04
  • What is an employeeNum? What datatype and what meaning? – Cade Roux Jun 08 '12 at 17:09
  • 1
    Any chance you can start over with the right data model? EF shouldn't be pointed at a nullable column for uniqueness. Why can't it use userid for the relationship instead of the optional employeeNum? – Aaron Bertrand Jun 08 '12 at 17:26
  • @AaronBertrand we have a mainframe that is 100% unaware of the db/website. We simply have it copy certain files into a directory with ftp (paystubs/etc). Since it doesn't know about the db it consequently doesn't know about the userid... Only the employee number. – Jared Jun 08 '12 at 18:55
  • So what is wrong with saying `UPDATE table SET EmployeeNum = 1000000 + UserID WHERE EmployeeNum IS NULL`? If EmployeeNum will stay below 1000000 then you've guaranteed no collisions and you've avoided hashing altogether. – Aaron Bertrand Jun 08 '12 at 19:05
  • @AaronBertrand employeeNum isn't limited to numbers it can also be letters+numbers. I simply stated MD5 because in MySQL it's as easy as MD5('value') and your done... Adding an arbitrary value to the userid should be fine. At the moment the field is simply a `varchar(50)` but could be expanded if the size needs to be larger. With this in mind I guess that might make answering the question of how to update the `null` or `""` fields without touching the fields with values easier. – Jared Jun 08 '12 at 19:28
  • What people don't realize about md5 is that broken colission resistance is actually meaningless most of the time. He is storing the plaintext as well as the digest in the same table. Collision resistance is irrelevant. – iPherian Apr 26 '17 at 21:29

3 Answers3

16

If your question is just how to generate a hash value for userid, you can do it this way using a computed column (or generate this value as part of the insert process). It isn't clear to me whether you know about the HASHBYTES function or what other criteria you're looking at when you say "best."

DECLARE @foo TABLE
(
  userid INT, 
  hash1 AS HASHBYTES('MD5',  CONVERT(VARCHAR(12), userid)),
  hash2 AS HASHBYTES('SHA1', CONVERT(VARCHAR(12), userid))
);

INSERT @foo(userid) SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 500;

SELECT userid, hash1, hash2 FROM @foo;

Results:

userid  hash1                               hash2
------  ----------------------------------  ------------------------------------------
1       0xC4CA4238A0B923820DCC509A6F75849B  0x356A192B7913B04C54574D18C28D46E6395428AB
2       0xC81E728D9D4C2F636F067F89CC14862C  0xDA4B9237BACCCDF19C0760CAB7AEC4A8359010B0
500     0xCEE631121C2EC9232F3A2F028AD5C89B  0xF83A383C0FA81F295D057F8F5ED0BA4610947817

In SQL Server 2012, I highly recommend at least SHA2_256 instead of either of the above. (You forgot to mention what version you're using - always useful information.)

All that said, I still want to call attention to the point I made in the comments: the "best" solution here is to fix the model. If employeeNum is optional, EF shouldn't be made to think it is required or unique, and it shouldn't be used in relationships if it is not, in fact, some kind of identifier. Why would a user care about collisions between employeeNum and userid if you're using the right attribute for the relationship in the first place?

EDIT as requested by OP

So what is wrong with saying UPDATE table SET EmployeeNum = 1000000 + UserID WHERE EmployeeNum IS NULL? If EmployeeNum will stay below 1000000 then you've guaranteed no collisions and you've avoided hashing altogether.

You could generate similar padding if employeeNum might contain a string, but again is it EF that promotes these horrible column names? Why would a column with a Num suffix contain anything but a number?

Aaron Bertrand
  • 272,866
  • 37
  • 466
  • 490
  • We are currently running SQL Server 2005. As to my knowledge it's pretty much limited to the Management Studio. My actual SQL knowledge is in MySQL and learning TSQL to a functional level is just another thing on my long list to learn/task to complete/etc. I say 'best' because I don't know what is best. If this had been MySQL I would most likely not even asked the question, but I just don't know enough about SQL Server outside of the Management Studio yet. I think the important thing to remember is the hash isn't for security! – Jared Jun 08 '12 at 19:00
  • Also whatever changes I make today will also have a modification to their Stored Procedures so that all new entries will contain the same changes. The application that uses this particular relation won't even be used yet, but I need to get the relations defined in the db so that I can write the membership provider. I'll most likely be modify the db to map a user to a role instead of an 'application' as I re-write the applications into MVC3 instead of asp.net pages. – Jared Jun 08 '12 at 19:20
  • 1
    @AaronBetrand it will only contain a letter if it's a specific employee "type" but it's still considered their employee number which is why we named the field as we did. While the literal meaning of the column name would not make sense to most. It does make sense to anyone inside the company which is why it was left the way it was. It's a name that they have used on the mainframe for 20+ yrs and probably won't be changed anytime soon :S – Jared Jun 15 '12 at 21:48
2

You could also use a uniqueidentifier setting the default value to (newid())

Create a new column EmployeeNum as uniqueidentifer, then:

UPDATE Employees SET EmployeeNum = newid()

Then set as primary key.

Gordon Bell
  • 13,337
  • 3
  • 45
  • 64
  • 1
    Or if it needs to be a positive INT, `ABS(CHECKSUM(NEWID())` ... but again that doesn't avoid collisions with existing EmployeeNum values. – Aaron Bertrand Jun 08 '12 at 18:06
1
UPDATE EMPLOYEE
    SET EMPLOYEENUM = HASHBYTES('SHA1', CAST(USERID AS VARCHAR(20)))
WHERE EMPLOYEENUM IS NULL
Russell Fox
  • 5,273
  • 1
  • 24
  • 28
  • If the EmployeeNum column is an INT, just be aware that while this produces rather large integer values when implicitly converted, there is nothing that prevents collisions with any existing employeeNum values. – Aaron Bertrand Jun 08 '12 at 17:49
  • 1
    Does it even need to be a hash? If you only have 250 records can you just update employeenum to equal userID, or 1000 + userID? – Russell Fox Jun 08 '12 at 17:50
  • I agree @Russell, I'm not sure I understand why the hash needs to be there either. – Aaron Bertrand Jun 08 '12 at 17:51
  • @AaronBertrand I've explained my reasoning in the comments of the original post. – Jared Jun 08 '12 at 18:52
  • @Jared if I still think the hashed values are not necessary, do you think you've convinced me? – Aaron Bertrand Jun 08 '12 at 18:55
  • @AaronBertrand I guess I don't understand that last statement. – Jared Jun 08 '12 at 20:14
  • @Jared I'll rephrase it another way: you haven't convinced me that hashed values are necessary. – Aaron Bertrand Jun 08 '12 at 20:28
  • @AaronBertrand I haven't determined that they were either. It's just SUPER easy to md5 hash in MySQL which is what I learned SQL on. I actually believe that your arbitrary # added to the userid (you posted the comment on my question) will work...I'll just have to use it as a string since it's a `varchar` field. To be extra clear I initially said MD5 due the ease of implementation on MySQL. I also feel that to many people (not saying anyone particular) get stuck on a hash being used for security which isn't the case here at all. I simply needed something unique! – Jared Jun 08 '12 at 20:57
  • @AaronBertrand if you will repost the comment that you made on the original question (the one suggesting adding '1000000' to uid) I'll select that as the answer as I see no issues with doing that besides needing to add it as a string instead of an integer value. – Jared Jun 08 '12 at 20:59