January 19, 2022
Buttondown had its first planned partial outage of the year. (Of 2021, too, if my memory isn’t failing me — but, you know, new year and all that.)
This was to upgrade Buttondown’s database: a measly but stalwart
db.t2.medium running on RDS, first spun up in 2017 and humming along ever since. 1 The general parlance when thinking ab out relational databases is “people tend to underestimate how strong and scaleable a small, boring Postgres box can be” and I think that is true: this cute lil $29/month instance has brought me very few complaints over the past few years.
The upgrade was necessitated by two factors:
First, the database (cleverly named
buttondown) was running Postgres 9.6, which Amazon was planning on formally deprecating and forcing upgrades. They’d pushed back the required upgrade window twice already, but I did not love the idea of waking up at 4am to a slew of pages because AWS decided when to take the database down.
Second, about that “strong and scaleable” thing… there is one relatively expensive and onerous part of Buttondown’s database, which is the event log — 170M records that represent every time an email has been sent or opened or clicked on or anything in that vein. This is what powers a number of experiences — the analytics page, stale subscriber detection, etc. — all of which have gotten much slower over the past few months, to the point where parts of the Analytics page straight up do not load for some of Buttondown’s bigger users.
To give you a sense of what the
email event schema looks like, here’s a slightly truncated copy-paste from the codebase:
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
token = models.CharField(max_length=100, unique=True)
email = models.ForeignKey("emails.Email")
subscriber = models.ForeignKey("emails.Subscriber")
event_type = models.CharField()
timestamp = models.DateTimeField()
creation_date = models.DateTimeField(auto_now_add=True)
metadata = JSONField(default=dict)
delivery_provider = EnumField(
I wrestled with how to tackle both of these for a while. I am a DBA by necessity, not by trade: the idea of spinning up a multi-AZ database and graceful write flip and all of these things felt like the right thing to do, but onerous for what would ideally be a one-off process.
I settled on a rough plan of attack:
- Spin up a fancier and stronger database — a
db.t4.xlarge— as a read-only replica of the existing database. 2
- Switch Buttondown into read-only mode, thus only hitting the read-only replica.
- Take the original database down, upgrading it to Postgres 13. 3
- Hope nothing explodes. 4
- Shift everything back over to the primary database.
- Upgrade the read-only replica to Postgres 13, too.
- Promote the read-only replica to a primary database and shift all the traffic over.
This is probably not something I can do eighteen months from now. 20 minutes of read-only traffic becomes less and less feasible as Buttondown’s average user activity grows, but it ended up working out extremely well.
Now, the sad part: my wishful thinking of “maybe I can just throw more hardware at the problem” has not borne fruit. The queries themselves are faster, but not an order of magnitude so: things are taking thirty seconds instead of sixty seconds when I need them to be taking three seconds.
This means I have to go back to what everyone was telling me I should have done from the start, and think with more nuance about the actual queries I’m making. To pick on the three most painful queries at the moment, and my rough plans of attack:
- Exporting all email events for a newsletter. This is particularly painful because Django’s ORM is doing a join on
EmailEvent.subscriber— a “get me all subscribers for newsletter $FOO, then all events for those subscribers” sort of thing. I’m adding a denormed foreign key —
EmailEvent.newsletter, and indexing pretty heavily on that.
- Inbox/client statistics. Every email event comes with a payload of reader metadata: what browser/OS/IP/etc. they’re using, which I feed into a cute little chart to show reader information. Rather than pull this through the huge, denormalized
EmailEventmodel, I plan on creating a new
Clientmodel that keys on
Subscriberand tracks this. I think this is not only going to be more performant but will result in some nicer subscriber-level insights: being able to say “firstname.lastname@example.org reads on an iPhone and in the Mimestream app” is a feature that is impossible at the moment.
- Aggregate events over a time series. This is the one that I’m perhaps the most nervous or unsure about. Newsletters are able to view high-level statistics about their open rate or click rate over a period of time (that time is somewhat hard-coded to a week/month/year at the moment, but I’d like to have arbitrary flexibility). I suspect that a simple compound index on
newsletter, event_typewill do the trick here but there might also need to be a
datethrown in there for good measure.
And you know what? All of these things seem tackle-able in isolation. I think I was guilty of thinking about this problematic model as a bit of a black box — everything that touched it was running into issues, therefore there needs to be one panacea of a change. Even if it’s going to take a bit more time than I hoped for or expected, going problematic query by problematic query is the right path forward.
(And now I get to figure out how to backfill 170M rows without tearing my hair out!)
A fairly common question has been “why are you using RDS”? This is a fair line of inquiry. RDS represents a bit of a Goldilocks solution to me; it is straight-up cheaper than the same hardware on Heroku’s managed database product, and ostensibly less difficult to maintain than using EC2 directly. ↩
My hope here is that I can naively throw money at the problem to make the database faster. I have had seventeen or so very smart people tell me “no, just add another index, you idiot”, and they are all correct, but also I am in the position of finding it extremely high-ROI to spend an extra $50/mo and not have to worry about indices for another few months. ↩
And, to AWS’ credit, this was as simple of a process as you can hope for. I give the AWS UI a lot of guff — all of it deserved — but it handles the job to be done of “modify an instance and bump up the database version” extremely well. ↩
Nothing did! ↩