May 19, 2019
Outage.
I had a rough whoopsie last week:
Ugh, pushed a change that broke login. Fix is going live right now.
— Buttondown (@buttondown) May 16, 2019
This is, roughly speaking, my nightmare! Even as I grow more and more rigorous about automatic testing (Buttondown’s regression suite is pretty extensive, and reflects thing that I’m most concerned about — email rendering and delivery, because email is a thing that can’t be reverted or undone once it’s sent) there are holes and embarassing failures.
The genre of this failure is particularly revealing in terms of where some of my sore spots are. Here’s what happened, roughly:
- I have lots of integration tests around login to make sure that, no matter what, an internal request to log in won’t fail.
- I also have some asynchronous things that happen after a user logs in: I send out an analytics event, I log it in Slack, and I generate some session data.
- On Thursday, I had cleaned up my Slack logging, breaking stuff out into a new subset of channels (see below).
- That ‘cleanup’ involved a bug where the Slack event I send for some users just failed completely.
- I disable slack logging when testing, as to not spam my feed.
- Oh.
- Oh no.
I’m lucky in that most outages are eminently fixable: this took me ~two minutes to roll back the latest commit and another ~fifteen to diagnose what the issue is. But there are still a lot of things to learn here:
- I’ve gotten a lot of confidence in my continuous integration setup, but I should still stop pushing new code on nights right before going to bed!
- Production-level testing (which I use Pingdom for) is something I should invest more in.
(As an aside, I can’t help but suspect I am still in the ‘newbie mode’ level of customer support, where everyone is unduly nice to me about these things. I can’t pretend that this isn’t at least somewhat by design — turns out when you’re nice and friendly to your users they respond in kind! — but I am dreading the day where I have legitimately unfriendly interactions.)
Malice!
Here is the least fun aspect of running Buttondown: malicious users. I have, in the past two years, only run into two bad actors, and last week marked my third.
I cannot emphasize enough how much of a bummer it is to diagnose a user as a villain. All three have been on the same genre (generic name, instantly sign up, register a card, and try spamming ten to twenty thousand emails) which makes it easy to spot, but it’s just — spending my time building out automation for the worst possible use case is sort of a morale-killer. I want to be cleaning up some interfaces or making rendering faster, not adding administrative layers around large user imports or adding a site-wide denylist.
(The one thing that is ‘fun’ is that this is a thing you only really run into on sufficiently mature projects, and divining a strategy for anti-spam from first principles will be a worthwhile exercise!)
Reducing the bus factor
Are you familiar with the bus factor? It is a useful (if not morbid) concept:
The “bus factor” is the minimum number of team members that have to suddenly disappear from a project before the project stalls due to lack of knowledgeable or competent personnel.
I have been trying to work the past few months with my eye to a future where a lot of customer-service-esque stuff is shouldered by an assistant. Largely, this means taking common procedures and tasks that exist in two places:
- my brain
- a command line tool
And depositing them in places that are much more non-Justin friendly, like:
- a wiki
- an admin/slack interface
What’s interesting is that this is a useful refactoring exercise in its own right. A piece of business logic that has to be invoked from Slack or from the command line or from the internal admin needs to be decentralized and relatively stateless; a piece of weird subscriber logic that can’t be easily explained or documented without a flow chart should probably be rewritten.
The to-do list.
Last week, I had three main goals:
- Migrating the User → Newsletter relationship from a one-to-one to a one-to-many. Done!
- Complete a short sprint on email addresses and email domain validations. Done!
- Ship a marketing page. Done — well, it came out as a blog post, but still counts.
This is the first time in a long time that I can remember biting off exactly as much as I meant to chew, which feels amazing. I’m finding that a structure where I work on a couple small things throughout the week (low-focus work, like documenting errors, making small tweaks, or fixing bugs) and then I have a big sprint of work on the weekend works really well for balancing “brick” work and “mortar” work.
This week, I’m going to go back to the well and pick out three things:
- Shipping an optional CAPTCHA component.. Folks are running into more spammers on embedded forms, and this will be a good layer to catch it. It’s not exciting work, but it’s been on my list forever, and is probably the most dominant genre of unhappy user feedback.
- Ship three blog posts.. These should all be relatively small (two are feature announcements, and one will be about Buttondown’s flavor of Markdown.)
- Ship some anti-malice detection.. It’s unclear what this will look like — I think a combination of IP denylist and keyword denylist should be sufficient for the short term – but there will be a lot of research and groundwork being laid here.