Zero downtime deploys: A tale of Django migrations

At Teem, we aim for zero down-time deploys; so, one of the most important things we must validate is that things will not break mid-deploy!

The most sensitive step of the deploy process is the changes to our database. Prior to the automation I am about to describe, validation of the database migrations required specialized knowledge about Postgres, the changes to the application model, load on the database for that model, and a bit of general experience. This obviously slows down reviews and subsequently deploys. Worse, it was simply too easy to miss problem migrations when depending on only peer reviews. To make our lives easier we created a series of validation checks to ensure that each database migration will be backwards compatible.

The what

The checks I am going to describe are simply a sequence of regex that we run on the migrations in the changelog. The process looks, roughly, like this:

Using git, generate a list of new migrations in this release,
Using Django’s sqlmigrate manage command, generate the SQL for each migration,
Run a sequence of regex on each SQL command,
Report the issues,
Profit!

We do this in a python script we internally call octoeb which uses Click to create a commandline interface. So, I can get a changelog along with an audit of the migrations in my current branch using:

$ ocotoeb changelog

I won’t describe the specific python code, instead I will give you equivalent bash commands that you can run in your CLI and a simple description of the regex that we are using. This will give you all the pieces you need to build a similar script in your favorite language.

The why

The basic goal is to ensure that any applied migrations are backwards compatible with the model definitions in the currently deployed release. This is a requirement because our current deployment process looks like:

pull the new release code to a single server,
run the migrations,
restart the application,
check the application status,
slowly roll the code to the rest of the servers.

As a result, during a deploy we have a mix of old model definitions and new model definitions running simultaneously. This means that the database must except both the old and the new for a short period of time and that any change we make to the database should not lock up an entire table.

Probably the most common change we often want to make is simply adding a new column to an existing model. This can present several issues. First, your new column should not set a default. In postgres, adding a column with a default will lock the table while it rewrites the existing rows with the default. This can easily be avoided by adding the column first without the default, then adding the default, followed by a future backfill on the existing rows. This will create the column and all future rows will have the default.

Implicit in the above recommendation is that all new columns must be nullable. You can not add a column without a default unless you allow null. Additionally, while the old models are running against the new table definitions, the app will set a value for that column, so it must either have a default or allow null otherwise Postgres will throw an error.

Finally, the other change that we need to watch for is removing columns. This is a multi-step process. If you drop a column while the old models are still active you will get two possible errors (1) when Django tries to select on that column that no longer exists (which it will because it always explicitly names the columns selected) or (2) attempting to insert data to a column that doesn’t exist anymore. To actually handle this type of model change you must deploy the model change prior to running the migration. In our process, that means you must commit the model change in a release separate from the database migration.

There are certainly other cases to consider, but we have found these 3 cases to cover the vast majority of our migration concerns. Having put these checks into place, we rarely have any issues with database migrations during deploy.

The how

Getting your list of migrations

To find the new migrations you can run the following command

$ git log --name-status master.. | grep -e  "^[MA].*migrations.*"

Breaking this down, git log --name-status master.. will print a log of the commits and the file changes in each commit between master and the current HEAD. The grep returns only those lines that start with A or M and also contains the work migrations. These are all of the new or modified migration files. It will return something like this

A	apps/accounts/migrations/0019_auto_20170126_1830.py

Getting your SQL

Once you have the list of migration files that you need to check, we need to get the actual SQL that is going to be run by Django. Fortunately, Django provides a simple command to get this SQL, sqlmigrate. For example,

django-admin sqlmigrate account 0002

will print the sql for the second migration in the accounts app. In the pervious section the ouput contains all of the information that we need. Specifically, with a command like this

$ git log --name-status master.. | grep -e  "^[AM].*migrations.*" | cut -d / -f 2,4 | cut -d . -f 1

we would get back exactly the list of apps and the migration name for each migration that we need to check

accounts/0017_auto_20170126_1342

This still isn’t quite what we need. At the end of the day the following command will generate the SQL for you

$ git log --name-status master.. | grep -e  "^[AM].*migrations.*" | cut -d / -f 2,4 | cut -d . -f 1 | awk -F"/" '{ print $1,$2}' | xargs -t -L 1 django-admin sqlmigrate

We are using python for our scripting, so my script is actually a bit different, using the regex apps/((.*)/migrations/(\d+[0-9a-z_]*))\.py and a combination of a for loop and subprocess to generate the SQL.

Regex magic

Now that we have the actual SQL that needs to be tested, it is simply a matter of running a few regex tests. We have 3 core tests that we run:

check_for_not_null which we test using /NOT NULL/
check_for_dropped_columns which we test using /DROP COLUMN/, and
check_for_add_with_default which we test using /ADD COLUMN .* DEFAULT/.

For each migration, we test those 3 regex and alert if they have any matches. As I mentioned earlier, there are certainly other cases that could be considered. Let me know if there are some additional checks I should add. Since we have implemented these checkes, I can’t remember the last time we had a migration issue during a deploy so they seem to cover most of the use cases we run into.