11

I have an existing model that looks somewhat like the following...

class Resource(models.Model):

    id = models.AutoField(primary_key=True)

We have been using this for some time, and now have ~1M instances of these Resource objects (and associated ForeignKey/else usages) in our database.

I now have a need to track another ID on this model, one that I want to enforce is unique.

other_id = models.IntegerField(unique=True)

This other_id information is currently stored in some external CSVs, and I want to (at some point in the process) load this information in to all existing Resource instances.

After adding the above field, Django's makemigrations works just fine. However when I go to apply said migration against an existing database I get an error indicating that I need to provide a default to use for all existing Resource instances. I'm sure many of you have seen something similar.

What is the best approach to getting around this limitation? Some methods I have thought of...

    • Remove the unique=True requirement
    • apply the migration
    • externally load in the other_id value to all existing models (through some management command, or 1-off script)
    • add the unique=True back in and apply the migration
    • Dump all existing data to JSON
    • flush all tables
    • apply the migration (with unique=True)
    • write a script that loads the data back in, adding the correct other_id value
  1. (unsure if this is possible) - Write some custom migration logic to automatically reference these external CSVs to load other_id values when I run manage.py migrate. This could hit issues if (at some point in the future) someone re-runs these migrations and this part fails (cannot find existing resource id in the CSVs to pull out other_id).

All of these feel complicated, but then again I guess what I am trying to do isn't the simplest thing either.

Any ideas? I have to imagine someone has had to work around a similar issue in the past.

Thanks!

wakey
  • 2,283
  • 4
  • 31
  • 58

2 Answers2

14

Actually, the source or your issue is not the unique constraint by itself but the fact that your field doesn't allow nulls and has no default value - you'd have the very same error with a non-unique field.

The proper solution here is to allow the field to be null (null=True) and default it to None (which will translate to sql "null"). Since null values are excluded from unique constraints (at least if your db vendor respects SQL standard), this allow you to apply the schema change while still making sure you cannot have a duplicate for non-null values.

Then you may want a data migration to load the known "other_id" values, and eventually a third schema migration to disallow null values for this field - if and only if you know you have filled this field for all records.

bruno desthuilliers
  • 75,974
  • 6
  • 88
  • 118
  • Thanks for the thoughts, and for pointing out the actual cause of the "must provide a default". Seems pretty similar to my #2, except I did not know data migrations were a thing! – wakey Jul 08 '19 at 12:46
  • This is very different from your second solution - you don't have to export data, flush tables, then edit and reimport data - all things that can take some time on a large database and are most definitly error prone. Actually this is closer to your #1, but is much safer as it makes sure you won't have non-null duplicate values (which would break the second migration adding back the unique constraint). – bruno desthuilliers Jul 08 '19 at 12:52
  • did I say #2? I meant #1, except I had unique and null mixed up :) Thanks again – wakey Jul 08 '19 at 12:59
9

Django has something called Data Migrations where you create a migration file that modifies/remove/add data to your database as you apply your migrations.

In this case you would create 3 different migrations:

  1. Create a migration that allow null values with null=True.
  2. Create a data migration that populate the data.
  3. Create a migration that disallow null values by removing the null=True added in step 1.

As you then run python manage.py migrate it would apply all of the migrations in step 1-3 in the correct order.

Your data migration would look something like this:

from django.db import migrations

def populate_reference(apps, schema_editor):
    MyModel = apps.get_model('yourappname', 'MyModel')
    for obj in MyModel.objects.all():
        obj.other_id = random_id_generator()
        obj.save()

class Migration(migrations.Migration):

    dependencies = [
        ('yourappname', '0001_initial'),
    ]

    operations = [
        migrations.RunPython(populate_reference),
    ]

You can create an empty migration file using the ./manage.py makemigrations --empty yourappname command.

Marcus Lind
  • 10,374
  • 7
  • 58
  • 112
  • Thanks for the answer! My only concern with using a DataMigration is that I would need to rely on some external CSVs to do the indexing (`resource.other_id = external_data[resource.id]`). Our processes create new `Resource` instances all the time, so I can't just copy the 'current state' of the CSV into my django app and expect it work for any migration ever in the future. – wakey Jul 08 '19 at 12:59
  • 1
    You can [squash your migrations](https://docs.djangoproject.com/en/2.2/topics/migrations/#squashing-migrations) when you are done and get rid of these steps and just create the final state of the model for future migrations. – Marcus Lind Jul 08 '19 at 13:04
  • @wKavey if you can't reliably have the `other_id` value immediatly available when a new `Resource` is created then you need to keep the field nullable (and with the None default value), as well as setting up some automated way of updating resources when "other_id" becomes available (he proper solution depending on the context of course). – bruno desthuilliers Jul 08 '19 at 14:50
  • @brunodesthuilliers thanks for following me down here. Currently the values for `other_id` are maintained in a CSV file mapping `id` to `other_id`. Whenever a new resource is created in the database, a new entry is added in this CSV. If I'm writing a data migration that utilizes this CSV to populate `other_id`, its fine if I'm sure that all resources and their `id`s are represented. My issue is that I need to check this migration into version control, presumably along with the CSV. There is no guarantee that if someone else runs this migration in the future that all IDs are still represented. – wakey Jul 08 '19 at 19:28