Postgres Joins and Django Querysets

How do we build a fast API against database models with foreign keys and many- to-many relationships? If you do nothing you get what I call the “waterfall of doom”. At some point in the past someone told me or I read that “joins are effectively free in Postgres”. While this might be somewhat true when you are writing all of the SQL and can control every part of your query; I have recently found that when the database gets big enough and you are using the Django ORM, joins aren’t free and less can be more!

Warning, I am not a DBA and mileage may vary.

At Teem we deal with a lot of calendar data. For various reasons this means we have database tables that looks something like this

+----------------------+    +----------------------+
|Calendar              |    |Event                 |
|========              |    |=====                 |
|- id                  |    |- id                  |
|- source              |    |- organization_id     |
|- organization_id     |    |- calendar_id         |
|                      |    |- organizer_id        |
+----------------------+    |- title               |
                            |- start_at            |
                            |- end_at              |
                            +----------------------+
+----------------------+    +----------------------+
|Participant           |    |Organizer             |
|===========           |    |=========             |
|- id                  |    |- id                  |
|- organization_id     |    |- organization_id     |
|- event_id            |    |- event_id            |
|- email               |    |- name                |
|- name                |    |- email               |
+----------------------+    +----------------------+

(Note this is not how it is actually setup in production.)

Now, you can be happy or upset with the design for various reasons but my biggest issue is how Django queries these tables for our API. The simplest most direct way to serialize an Event including its calendar, organizer, and the participants cause what I call the “waterfall of doom”, a.k.a. the N+1 problem. This topic is covered in many places throughout the internet, for example:

so I won’t focus on it here; but basically, every request for an Event results in 4 more requests (the waterfall) for Organization, Calendar, Organizer, and the Participants respectively. Even worse, if I request a list of Events the simplest code for the API will do 4 database queries for each Event in the list (the waterfall of doom). For a single event this isn’t noticeable, but if I want to serialize all events in a calendar, it is a big performance problem.

We use Django Rest Framework, so the simple (and bad) Event serialization described above looks something like this:

class EventSerializer(ModelSerializer):
    organizer = OrganizerSerializer(read_only=True)
    calendar =  CalendarSerializer(read_only=True)
    organization = OrganizationSerializer(read_only=True)
    participants = ParticipantSerializer(
        source='participant_set', many=True, read_only=True)

    class Meta:
        fields = (
            'id', 'start_at', 'end_at', 'title',
            'organizer', 'calendar', 'organization', 'participants',
        )

class EventViewSet(ReadOnlyModelViewSet):
    queryset = Event.objects.all()
    serializer_class = EventSerializer

It is worth noting that this is clearly not how our production API is currently designed; but, it demonstrates the core performance issue. Fortunately, Django provides a couple tools out of the box to help us solve this performance issue: prefetch_related and select_related.

If you are familiar with Django, then this is not news to you. Given the setup above you would likely jump on select_related and call it a day, e.g.

class EventViewSet(ReadOnlyModelViewSet):
    queryset = Event.objects.select_related(
        'organizer', 'calendar', 'participant_set').all()
    serializer_class = EventSerializer

In fact, for a short time, this optimization worked for us; but a problem pops up when the database starts getting big. The above ViewSet will generate SQL like this

SELECT COUNT(*) FROM (
    SELECT DISTINCT ON (
        e.start_at, e.end_at, e.title
        )
        e.*,
        o.*
    FROM event e
        LEFT OUTER JOIN participant p
            ON ( e.id = p.event_id )
        LEFT OUTER JOIN organizer o
            ON ( e.id = o.event_id )
    WHERE (
        e.calendar_id =  5051
        AND e.start_at <= '2016-12-02 00:00:00+00:00'
        AND e.end_at >= '2016-12-01 00:00:00+00:00'
        AND (
            UPPER(p.email::text) = UPPER('name@example.com')
            OR UPPER(o.email::text) = UPPER('name@example.com')
        )
    )
    ORDER BY
        e.start_at ASC,
        e.end_at ASC,
        e.title ASC,
) sub;

Which looks good, but here is the problem

21 seconds querying the events table in our real life production database! Why oh why is this happening? A quick EXPLAIN ANALYZE shows that Postgres is doing sequential scan on participant and organizer. Now, I am not a DBA so I can’t fully explain why this is happening, but I do have some ideas about how to avoid the scan. Here is the fastest query I could make in SQL

SELECT COUNT(*) FROM (
    SELECT DISTINCT ON (
        e.start, e.end, e.title,
        )
        e.*,
        o.*
    FROM event e
        LEFT OUTER JOIN (
            SELECT * FROM participant
            WHERE
                participant.organization_id = 9776
                AND UPPER(participant.email::text) = UPPER('name@example.com')
        )
        p ON ( e.id = p.event_id )
        LEFT OUTER JOIN (
            SELECT * FROM organizer
            WHERE UPPER(organizer.email::text) = UPPER('name@example.com')
        ) o ON ( e.id = o.event_id )
    WHERE (
        e.organization_id =  9776
        AND e.start_at <= '2016-12-02 00:00:00+00:00'
        AND e.end_at >= '2016-12-01 00:00:00+00:00'
        AND (p.id IS NOT NULL OR o.id IS NOT NULL) -- User filter applied in subselect joins
    )
    ORDER BY
        e.start_at ASC,
        e.end_at ASC,
        e.title ASC
) sub;

This query is fast in production, but … it can not be produced using the Django ORM. Here, I explicitly made subqueries on indexed columns to speed it up. In a future post, I promise to discuss how and when we use raw SQL so speed up some of our requests. But, for the general CRUD API endpoint, I want to use the ORM so that I can leverage the great filter framework provided by Django Rest Framework. After a lot of tinkering, the best ORM compatible query I could design is this

SELECT COUNT(*) FROM (
    SELECT DISTINCT ON (
        e.start, e.end, e.title
        )
        e.*,
        o.*
    FROM event e
        LEFT OUTER JOIN participant p
            ON ( e.id = p.event_id )
        LEFT OUTER JOIN organizer o
            ON ( e.id = o.event_id )
    WHERE (
        e.organization_id =  9776
        AND e.start_at <= '2016-12-02 00:00:00+00:00'
        AND e.end_at >= '2016-12-01 00:00:00+00:00'
        AND (
            UPPER(p.email::text) = UPPER('name@example.com')
            OR UPPER(o.email::text) = UPPER('name@example.com')
        )
        AND (p.organization_id = 9776 OR p.organization_id IS NULL)
    )
    ORDER BY
        e.start_at ASC,
        e.end_at ASC,
        e.title ASC
) sub;

For specific API queries, we actually using a similar query in production. But, I couldn’t find a good general solution until I realized that trying to do everything at once might actually be too much to ask for, especially since the API enforces relatively small pages. While reading through the documentation to hunt down another bug, I re-read the docs son prefetch_related and it occurred to me that I could guarantee exactly 4 fast queries to the db on every API call instead of one sporadically slow query. With the following small change to our viewset definition, our slowest API call is an order of magnitude faster than the previous call with the select_related (although there is room for improvement)

In fact, the careful observer will notice that the called components in this trace are a bit different. Well, the prefetch_related was so successful, that the same exact query no longer qualifies for being traced in New Relic and the average dropped by half!

I would like to caution that I believe this works so well in this case because we limit the API page size, so, the data and query sizes in the related queries are relatively small per request. If you need to load a lot of data at once, I can foresee this solution being much worse than select_related.

TL;DR: Monitor and analyze your production queries and prefetch_related can be a great solution if you can keep the number of queries and the page size small.