Delete or not to delete
Should you remove data from the database or simply mark it as deleted?
At Teem we have a lot of data that we need to manage and often “physically”
deleting the data from disk can be problematic. Either the users simply wants
to undelete something or the deletion would cause problems for a log. The
generic solution to this problem is to soft delete/archive the data by adding
a deleted_at
timestamp field to the table and then filter all queries to
hide rows that have been marked as deleted.
This simple approach can take you a long way but doesn’t fully cover all the required scenarios. More specifically, what do you do with related objects? About a year ago, I realized that we really needed to address the entire problem. There are several pieces of data that we never want to delete, e.g. devices and visitors among others. At the time I couldn’t find a really good solution so I wrote my own: django-archive-mixin.
As of today, there are at least two other options that cover this situation as well and you should checkout too:
The biggest distinction between our project and the ones above is that I tried
to stay as close to the delete logic used in the original Django source code as
possible. In particular, this means mimicking Django’s collector
code that
implements the ORM cascade delete. It is interesting to note that Django does
not rely on the database’s cascade functionality but instead manages the delete
process itself. The benefit of doing this is that you can then specify behavior
at delete time via the on_delete
argument to the model field. For example:
class Car(models.Model):
manufacturer = models.ForeignKey(
'production.Manufacturer',
blank=True, null=True
on_delete=models.SET_NULL,
)
will cause the manufacturer field to be set to None
when you delete the
related manufacture. The default behavior would be to delete the car instance
as well.
Django provides 6 on_delete
options:
CASCADE
, PROTECT
, SET_NULL
, SET_DEFAULT
, SET()
, and DO_NOTHING
.
At delete, the Django collector
crawls the relationships and
buckets each object found into different lists depending on the on_delete
configuration for that specific relationship. CASCADE
puts the object in a
bucket to be deleted, PROTECT
will cause an exception to be thrown,
SET_NULL
, SET_DEFAULT
, and SET()
each cause and update to that instance,
and DO_NOTHING
is a no-op. Once I understood this process, I decided to
piggyback on the process and add an additional piece of logic to put more
objects into the update bucket. Essentially, I allow Django to do all of the
collection for me and then I go through the list of objects to inspect if it
should be archived instead, if it is, I move it to the update bucket and move
on.
def cascade_archive(inst_or_qs, using, keep_parents=False):
"""
Return collector instance that has marked ArchiveMixin instances for
archive (i.e. update) instead of actual delete.
Arguments:
inst_or_qs (models.Model or models.QuerySet): the instance(s) that
are to be deleted.
using (db connection/router): the db to delete from.
keep_parents (bool): defaults to False. Determine if cascade is true.
Returns:
models.deletion.Collector: this is a standard Collector instance but
the ArchiveMixin instances are in the fields for update list.
"""
from .mixins import ArchiveMixin
if not isinstance(inst_or_qs, models.QuerySet):
instances = [inst_or_qs]
else:
instances = inst_or_qs
deleted_ts = timezone.now()
# The collector will iteratively crawl the relationships and
# create a list of models and instances that are connected to
# this instance.
collector = models.deletion.Collector(using=using)
if StrictVersion(django.get_version()) < StrictVersion('1.9.0'):
collector.collect(instances)
else:
collector.collect(instances, keep_parents=keep_parents)
collector.sort()
for model, instances in collector.data.iteritems():
# remove archive mixin models from the delete list and put
# them in the update list. If we do this, we can just call
# the collector.delete method.
inst_list = list(instances)
if issubclass(model, ArchiveMixin):
deleted_on_field = get_field_by_name(model, 'deleted_on')
collector.add_field_update(
deleted_on_field, deleted_ts, inst_list)
del collector.data[model]
for i, qs in enumerate(collector.fast_deletes):
# make sure that we do archive on fast deletable models as
# well.
model = qs.model
if issubclass(model, ArchiveMixin):
deleted_on_field = get_field_by_name(model, 'deleted_on')
collector.add_field_update(deleted_on_field, deleted_ts, qs)
collector.fast_deletes[i] = qs.none()
return collector
What I love about this logic is that is is a fairly small change to how deletion works while also being fairly low-level enough that it covers all of the deletion cases that the Django ORM handles.
We have been using this mixin for about a year now with no hiccups. It works as expected and hasn’t really needed much attention. If you are using Django and have been looking for a safe delete/archive utility check it out and let me know what you think.