Denormalisation Follies

One of the topics that popped up repeatedly at DjangoCon last weekend was how bad purely normalised table structures are, and how denormalisation is good for many things, including making your database cry less.

To that end, during the gaps in PyConUK this weekend, I decided to see how easy it would be to write a new Django field that will automatically denormalise a field in a related table across to another model.

Let's take an example; imagine we have this kind of model:

class Picture(models.model):
user = models.ForeignKey(User)
user_username = DenormField(User, "username")
image = models.ImageField(upload_to="foo")
title = models.CharField(max_length=100)

The idea is that the user's username (which, let's face it, you're going to need almost every time you need the photo) is stored in the same table, saving you from doing that join to get it from users each time. While this may not seem like much, it can be useful, and besides, it was fun to write.

My DenormField simply listens to a trio of the relevant signals, and then updates this value whenever it detects that the related User object has been updated, so it only costs on write (which, let's face it, is much better than on every read). You can do this kind of copying in normal Django app code as well, but I think this solution is much cleaner. I would do, I guess, since I wrote it.

It's only really a little test exercise in writing Strange New Fields (and I wanted to see how much monkeypatching it would take - none, as it turns out, just some very nasty code tricks). You can get a copy of it if you feel like having a play (warning: only briefly tested; may attach captions to your cat).

Posted 14th September 2008 in Python, and Django, with 19 comments

comments

  1. Eric Florenzano

    Wow, that's fantastic! You should really contact the django-extensions guys, who have been doing some great things with custom fields.

    Also, what do you think about a field which does a small bit of precomputation, as well, for example being passed an arbitrary function that would be called on each save and given access to initial value and new value?

  2. Dmitry

    You should definitively put this on djangosnippets.com :)

  3. Andrew Godwin

    Yeah, the idea for this was partially inspired by the django-extensions' auto_now replacement field; I figured there was more scope for magical fields than just that.

    As for putting it on djangosnippets, I'd prefer something with more explicit version control, since there's likely to be bugs in it, so an applicable project would be better.

  4. Clint Ecker

    Fabulous! :)

  5. David Z

    Actually, this would be awesome to have in trunk -- it's immensely useful.

  6. Giuliani Vito, Ivan

    Awesome work! Anyway I don't get the lines 57 and 58 (but I'm not really into the django ORM). Sounds to me like some kind of nasty hack, why you suppose that a foreign key could end with "_id"?

  7. Corne

    very nice, One could solve this by creating models against a materialized view in postgres, but this is database independent

  8. Aidas Bendoraitis [aka Archatas]

    As a design suggestion, I think, that it would be more flexible, if it took the foreign-key field name as a parameter instead of the model. So you could have two or more users in the model with their denormalized fields, for example:

    class Picture(models.model):
    author = models.ForeignKey(User, related_name="taken_pictures")
    author_username = DenormField("author", "username")
    publisher = models.ForeignKey(User, related_name="published_pictures")
    publisher_username = DenormField("publisher", "username")
    owner = models.ForeignKey(User, related_name="owned_pictures")
    owner_username = DenormField("owner", "username")
    image = models.ImageField(upload_to="foo")
    title = models.CharField(max_length=100)

    The model of the foreign key then can be retreived by
    cls._meta.get_field(field_name).rel.to, for example,
    Picture._meta.get_field("author").rel.to == User

  9. Andrew Godwin

    Aidas, I agree, that is probably a neater solution, I just wrote the one that read better. My solution does actually allow you to pass in a foreign key name, as well, however, so it allows for more than one foreign key to the same model.

    I'm going to clean up the code a little and then see where to take the code. It might be the django-extensions guys, it might be daring to take out a ticket against trunk...

  10. Tom Dyson

    +1 for trunk!

  11. Erik Allik

    What if the related model/field or the denorm. field was derived from the name of the field? so:

    author = fields.ForeignKey(User)
    author__username = fields.DenormField()

    would be the same as:

    author = fields.ForeignKey(User)
    author_username = fields.DenormField(author, 'username')

    Also the Django ORM could automatically use this DB column instead of author.username in the case of MyModel.objects.filter(author__username=abc).

    Just an idea :)

  12. Erik Allik

    Forgot to mention that this would force people to stick to a common naming scheme for denormalized fields so it wouldn't be possible to write:

    author - fields.ForeignKey(User)
    nobody_knows_what_this_field_does = fields.DenormField('author', 'username')

  13. Andrew Godwin

    Erik, that would work, but doesn't really gel for me - Django for me is all about doing things both explicitly and allowing you to change things, so forcing people into one naming scheme, and then having to override filter() so it magically detects that column rather than splitting on __ like it usually does, isn't the kind of thing I'd be thinking of.

    I think the amount of people who'd be utterly confused by the edge case bugs wouldn't be worth the saving of typing a few less characters, as well as me regarding a field that takes NO arguments and does all that as a little _too_ magic.

  14. Erik Allik

    Andrew, I have to agree with you about the magic part. I just thought that it would be neat, besides denormalization could be interpreted as a magic way of caching stuff to avoid expensive queries/computation. But that's just my point of view.

    BTW, why not just provide a ComputedField that takes a function that is passed self and then provide a whole set of functions that do stuff: one for auto_add_now and auto_now, one for caching a field of a related model, some for doing aggregation etc. I think that'd be Pythonic and extremely flexible. Specifying the fkey field and related model field would then just become arguments to a function factory:

    author_username = fields.ComputedField(func=cache_related_model_field(author, 'username'))

    fields.ComputedField(func=auto_add_now)
    fields.ComputedField(func=auto_now)
    number_of_posts = fields.ComputedField(func=num_related(post)

    etc.

    The function could also register signals for more advanced cases.

  15. Andrew Godwin

    We're discussing this on django-developers now: http://groups.google.com/group/django-developers/browse_thread/thread/9a672d5bbbe67562

    Basically, I'm edging towards something very similar, Erik, which is having an arbitary QuerySet as the argument. Combined with the work on the new aggregation functions, this will solve a lot of use cases.

    There's something to be said for having arbitary functions, too, however; the only caveat is that one must say which model they apply on, and for some which key, so I need to go think this out.

  16. Erik Allik

    If the function can access self, what does it need anything else for? Why pass a queryset when the function can simply construct it on will? Maybe I'm missing something.

  17. Andrew Godwin

    Well, the field needs to be able to know what it's supposed to be denormalising; hence you need to pass a queryset so it can find the model (which is usually a different model from the one the field is in) and any conditions (e.g. only comments from this post, for count queries).

  18. Biappi

    While the DenormField is really cool, and i think it should be used (wisely), in this particular example, i think it's not needed at all.

    Fact is, in many web services, the username *is* a key, in fact i do not know why many designers insist in using an integer primary key /and/ a unique username identifier.

    It's not written at all that every table in a DB must have an autoincrementant integer primary key, if the application domain permits it, it's much simpler to use keys that are in the application world.

    Of course the DenormField is the only way if, for example, the key is the email address and we want to show a "screen name" that's not unique or that changes in time...

  19. huxley

    @Biappi,

    In principle it's a good idea. After all a username is unique, but in practice the point of a primary key is that it should be a permanent unchanging identifier, but that rules out most data in userspace.


  20. (optional)