12

I have a Django web application that uses the default auto-incremented positive integers as the primary key. This key is used throughout the application and is frequently inserted into the URL. I don't want to expose this number to the public so that they can guess the number of users or other entities in my Database.

This is a frequent requirement and I have seen questions to similar mine with answers. Most solutions recommend hashing the original primary key value. However, none of those answers fit my need exactly. These are my requirements:

  1. I would like to keep the Primary Key field type as Integer.
  2. I also would prefer not to have to hash/unhash this value every time it is read or written or compared to the database. That seems wastefuly It would be nice to do it just once: When the record is initially inserted into the Database
  3. The hashing/encryption function need not be reversible since I don't need to recover the original sequential key. The hashed value just needs to be unique.
  4. The hashed value needs to be unique ONLY for that table -- not universally unique.
  5. The hashed value should be as short as possible. I would like to avoid extremely long (20+ characters) URLs

What is the best way to do achieve this? Would the following work?

def hash_function(int):
    return fancy-hash-function # What function should I use??


def obfuscate_pk(sender, instance, created, **kwargs):
    if created:
        logger.info("MyClass #%s, created with created=%s: %s" % (instance.pk, created, instance))
        instance.pk = hash_function(instance.pk)
        instance.save()
        logger.info("\tNew Pk=%s" % instance.pk)

class MyClass(models.Model):
    blahblah = models.CharField(max_length=50, null=False, blank=False,)


post_save.connect(obfuscate_pk, sender=MyClass)
e4c5
  • 52,766
  • 11
  • 101
  • 134
Saqib Ali
  • 11,931
  • 41
  • 133
  • 272

4 Answers4

23

The Idea

I would recommend to you the same approach that is used by Instagram. Their requirements seems to closely follow yours.

Generated IDs should be sortable by time (so a list of photo IDs, for example, could be sorted without fetching more information about the photos) IDs should ideally be 64 bits (for smaller indexes, and better storage in systems like Redis) The system should introduce as few new ‘moving parts’ as possible—a large part of how we’ve been able to scale Instagram with very few engineers is by choosing simple, easy-to-understand solutions that we trust.

They came up with a system that has 41 bits based on the timestamp, 13 o the database shard and 10 for an auto increment portion. Sincce you don't appear to be using shards. You can just have 41 bits for a time based copmonent and 23 bits chosen at random. That does produce an extremely unlikely 1 in 8.3 million chance of getting a conflict if you insert records at the same time. But in practice you are never likely to hit this. Right so how about some code:

Generating IDs

START_TIME = a constant that represents a unix timestamp

def make_id():
    '''
    inspired by http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram
        '''
    
    t = int(time.time()*1000) - START_TIME
    u = random.SystemRandom().getrandbits(23)
    id = (t << 23 ) | u
    
    return id


def reverse_id(id):
    t  = id >> 23
    return t + START_TIME 

Note, START_TIME in the above code is some arbitary starting time. You can use time.time()*1000 , get the value and set that as START_TIME

Notice that the reverse_id method I have posted allows you to find out at which time the record was created. If you need to keep track of that information you can do so without having to add another field for it! So your primary key is actually saving your storage rather than increasing it!

The Model

Now this is what your model would look like.

class MyClass(models.Model):
   id = models.BigIntegerField(default = fields.make_id, primary_key=True)  

If you make changes to your database outside django you would need to create the equivalent of make_id as an sql function

As a foot note. This is somewhat like the approach used by Mongodb to generate it's _ID for each object.

dfrankow
  • 20,191
  • 41
  • 152
  • 214
e4c5
  • 52,766
  • 11
  • 101
  • 134
  • Even though its not mathematically perfect, (small possibility of collision) If its good enough for Instagram, it's good enough for me! Thanks very much. I will award you the bounty after I implement this solution (probably this upcoming weekend) – Saqib Ali Jun 03 '16 at 07:13
  • 1
    oops, should have mentioned that START_TIME should be an int. eg 1464972048475 – e4c5 Jun 03 '16 at 16:59
  • Ok. Thanks `make_id()` is now working. But when I attempt to create an object it fails: https://gist.github.com/saqib-zmi/4c3fad6d922ea0437c297065f71ba2d3 – Saqib Ali Jun 04 '16 at 01:20
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/113785/discussion-between-saqib-ali-and-e4c5). – Saqib Ali Jun 04 '16 at 01:46
  • `id = models.BigIntegerField(default=make_id(), primary_key=True)` FYI, My database is MariaDB if that makes any difference. (However, Django doesn't know that. It's a drop-in-replacement for mysql) – Saqib Ali Jun 04 '16 at 02:40
  • I figured out my problem. I had forgotten to do a database migrations. Whoops. Thanks!! – Saqib Ali Jun 04 '16 at 05:12
  • Glad it worked out. Sorry couldn't reply to your commment in time. – e4c5 Jun 04 '16 at 08:03
  • Follow-up question: http://stackoverflow.com/questions/37627600/django-how-to-migrate-primary-key-to-bigintegerfield-of-class-that-inherits-fro – Saqib Ali Jun 05 '16 at 00:08
  • FYI sqlite can't store 64-bit integers, so if you try with that DB (e.g., for testing), you'll get `OverflowError: Python int too large to convert to SQLite INTEGER` – dfrankow Feb 25 '21 at 17:32
  • I believe Postgres also doesn't store unsigned bigint. https://www.postgresql.org/docs/13/datatype-numeric.html is signed. So, I am a bit fuzzy on whether this solution works on Postgres if the highest bit is set. I can imagine a signed solution, but I'm not sure if this code is it. – dfrankow Feb 26 '21 at 17:11
  • I'm pretty sure this answer has a bug (using the highest bit) on Postgres, which I detail in https://stackoverflow.com/q/66391565. – dfrankow Feb 26 '21 at 20:45
13

You need to separate two concerns:

  1. The primary key, currently an auto-incrementing integer, is the best choice for a simple, relatively predictable unique identifier that can be enforced on the database level.

  2. That does not mean you have to expose it to users in your URLs.

I'd recommend adding a new UUID field to your model, and remapping your views to use it, instead of the PK, for object lookups.

Jack Shedd
  • 3,501
  • 20
  • 27
-1

A really simple solution is simply encrypting the ID before sending it out to an external source. You can decrypt it on the way back in.

Robert Christopher
  • 461
  • 1
  • 6
  • 15
-4

Keep the AUTO_INCREMENT, but pass it around in a semi-secret way: In a cookie. It takes a bit of coding to establish the cookie, set it, and read it. But cookies are hidden from all but serious hackers.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • 4
    No. Cookies can be examined by anyone who clicks a few buttons on their browser. Relying on the technical unsaviness of users is not a good way to implement security. – Saqib Ali Jun 03 '16 at 06:55
  • @SaqibAli - True, you can see _your_ cookies. But someone on another computer should not be able to see them without first hacking into your computer. And, cookies are visible (in your browser) _only_ to the same domain that issued them. That is, another web site cannot snoop on your cookies through your browser. (Yes, there is XSS, but most site protect against that.) – Rick James Jan 13 '21 at 16:11