Article / 3rd Nov 2022

Static-Dynamic Content With In-Memory SQLite

This website has been running in some form since 2006, and back then it was one of my very first Django sites, stored in a Subversion respository and using the thrillingly new Python 2.5.

I can't actually remember if it was the very first thing I built in Django, but I think it might have been.

It was a very basic CMS built out of the standard building blocks Django is still known for - the admin, forms, and easy templating. It stayed like that for many years, with me authoring blog posts via a somewhat-custom markup language in a big text box in the Django Admin.

While this worked well for short text posts, it had many downsides - easy to lose work, the markup language was strange, and uploading images was especially hard as I had to put them onto a blob storage service myself.

Faced with this, in 2017 I migrated this site to Wagtail, an excellent CMS whose spiritual predecessors I worked on back when I was at Torchbox. This solved my image upload problems immediately, and with its flexible StreamField - a data type composed of a set of customisable "blocks" - I had a more reliable way of making rich content than a custom markup language.

Quickly, though, I started to grow tired of having to edit every single thing in an admin interface, and all the cool kids were running around with their static site generators editing content in nice text editors. I still wanted some dynamic features, though - this site runs a few things behind the scenes, including an SMS gateway, and I didn't want to throw those away.

Part Static, Part Dynamic

So, I decided to mix the two ideas - static content, authored as YAML files and committed into the Git repository that now backs the site, but loaded and run as part of a dynamic Django site.

How does one do this? Well, Django already has a fixtures system to load content into the database from flat files, but it's meant much more for tests, and isn't terribly flexible. In fact, in general, I realised that for loading from flat files, you really don't want the database to persist, as if you have a fresh one each time you can ignore all the update and matching problems.

I think it's fair to say that Django's fixtures system is not entirely designed to be human-writable. This is maybe something we could fix.

So, I decided to use Django's multi-db support and back certain models - the ones I wanted loaded from flat files - from an in-memory SQLite database generated from my content every single server boot, rather than the normal PostgreSQL that the rest of the site runs off of.

I started with the page I was most annoyed doing through an admin interface - my speaking page, that lists all my talks. The YAML flat file format looks like this:

- conference: Electromagnetic Field
  title: "Taming Terrain: Sculping geoscapes from LiDAR"
  when: 2018-08-31
  city: Ledbury
  country: GB
  length: 30
  slides_url: https://speakerdeck.com/andrewgodwin/taming-terrain-sculpting-geoscapes-with-lidar
  video_url: https://www.youtube.com/watch?v=Qg0hvqDa3wk

- conference: PyCon ES
  title: Taking Django Async
  when: 2018-10-06
  city: "Málaga"
  country: ES
  length: 30

These YAML entries just map to a Django model - the code I wrote makes an in-memory SQLite database, uses Django's SchemaEditor internals to make tables in it from the model definitions (no migrations here!), and then just bulk inserts them into a model like this one:

class Talk(models.Model):

    __yamdl__ = True  # See what this means at the end of the blog post

    title = models.CharField(max_length=255)
    description = models.TextField(blank=True)

    length = models.IntegerField(blank=True, null=True)
    keynote = models.BooleanField(default=False)

    when = models.DateField()
    conference = models.CharField(max_length=255, blank=True, null=True)
    city = models.CharField(max_length=255, blank=True, null=True)
    country = CountryField(blank=True, null=True)

    slides_url = models.URLField(blank=True, null=True)
    video_url = models.URLField(blank=True, null=True)

This is a lot easier to edit quickly than the old format, and I can pop back in and quickly throw in slide and video links in a few minutes rather than lots of clicking.

It also will recurse through a directory, so I end up with a directory tree like this, each file containing the set of entries for each year:

content/
    speaking.Talk/
        2019.yaml
        2020.yaml
        2021.yaml

This was how it stayed for several years - ancillary data stored in these flat files, but the main blog posts were still in Wagtail. Recently, though, I realised that the reason I basically don't blog enough - even with the realisation that I can just do short posts - was because the friction to doing so was just too high.

So, given this, I decided to apply the solution I already knew worked to the rest of the site - except this time, YAML wasn't quite enough.

Adding Markdown

Given I was already had a method to load content from YAML files, the main obstacle I faced was how to encode the blog post bodies - the StreamField abstraction from Wagtail, while nice, would be hellish to write in YAML.

A markup language was the obvious choice, and as much as I am a fan of ReST, I decided to finally give in and use Markdown due to better editor support for it. I needed a way to mix the two file formats, but I decided to copy the "front matter" format I've seen elsewhere and just separate a YAML document and markdown document with the --- separator:

date: 2022-11-01 12:00:00+00:00
image: blog/2022/cyber-mastodon.jpg
slug: fediverse-custom-domains
title: The Fediverse, And Custom Domains

---

For reasons that are very relevant to the events of this past week, I have
decided to finally get around to setting up a Mastodon / Fediverse account.

I have the parser map any markdown content to a "content" column with the rest of the YAML data, and understand that objects can have a file to themselves as well as all being in a file together.

The last job I had was to map all my existing blog posts out of Wagtail's tables into this new format - I will give a lot of credit to Wagtail's design here, as the StreamField, at its core, is just a list of dicts with type keys, and any rich text is just in HTML.

By mapping all my custom type blocks to the appropriate new Markdown (with a couple of small extensions for things like easy YouTube embeds), and using markdownify to convert the HTML content into Markdown, I had a mapper for all my old content (which goes back to 2006, with the same URLs it had back then!)

This means my blog posts now look like this:

content/
    blog.Post/
        03-19-van-electrics.md
        11-01-fediverse-domains.md

Seeing Things

The final piece of this puzzle was images - with blog content checked into the repository, I wanted to do the same with them. That's easy enough at the top level - I placed them in a fixed media directory, and make sure my YAML loader could correctly get content into Django's ImageField.

Thumbnails were the main stumbling block - a traditional static generator would make them at compile time, but I don't have that option. Instead, I wrote a small thumbnailer that hashes the file name and requested dimensions to a predictable filename, checks if it exists, and then writes it if not.

If you too want to write a small thumbnailer, Pillow's ImageFit probably has almost all the basic sizing functionality you might want.

This means I can browse through the site locally, generate the thumbnails for the most-hit pages, and check them in along with the original images so they're already there when someone asks for the page. If not? Well, it turns out that, compared to 2006, servers are Really Fast Now - and so it just thumbnails the image on the fly and serves it. It's cached on the local disk for as long as the Cloud Run server is up, but it's lost quite soon - but no real issue, as those will only be images on less-visited pages.

It's All In Memory Now

What all this means now is that nearly all the site's tables - and all the ones that are used to render these publicly-visible pages - are served from that in-memory SQLite database. If you're in North America (and so near enough to the datacentre), refresh the page a few times or click around, and you will almost certainly - unless you hit a new Cloud Run server and its cold boot - see incredibly fast response times. Turns out that serving everything from an in-memory SQLite is almost as quick as static files!

While this would work on an old-style installation of Django where it's on a server, it works particularly well on Cloud Run as well - with GitHub Actions wired up appropriately, I can jot out a blog post, commit it, and wander away while the new Docker image is built and deployed in a couple of minutes.

I also get to keep using Django to do all sorts of fun dynamic trickery, which I do, of course, enjoy. This technique is maybe not for everyone - there's some really lovely static site generators these days - but for my specific needs, it's wonderful.

If you want to use it, all this code is released as an open-source library called yamdl. I don't remember why I named it that when I started work on this back in 2017, but it's now at a 1.0 release, and even includes autoreloader support for the static YAML files - so you can edit them locally, and Django will reload and re-import your changes.

It just needs installing as an app, a database router loaded, and a couple of other settings (see the README!) and then you can just label a model with the __yamdl__ attribute, and boom, it will go looking for content in content/appname.ModelName/!