Expanding to multiple ESPs

Because

                            March 1, 2023

                Expanding to multiple ESPs

                        Well, gang, thanks for letting me soft-launch surveys last week. The results are in, and I promise to play by the rules: this week, a technical deep dive and next week a war story.

So, a bit of a technical dive, courtesy of Matt:
have you written about using multiple ESPs before? are you doing some kind of round-robin? splitting per account? one primary and the others as fallback for downtime?
— matt swanson 😈 (@_swanson) February 20, 2023

When Buttondown launched, it was heavily oriented around a single email service provider — Mailgun. using (and still uses) django-anymail on the backend.
(You might be tempted to ask “why Mailgun”, to which I have a boring and simple answer — it had a free tier and my day job was using it.)
Because I was using django-anymail, there wasn’t a lot of technical build-out work required. The most consequential action I really made in those early days was to poorly name my email events database (where click events, open events, and so on are stored) mailgun_events. (A name that has stuck around, because the costs and risks of renaming a table [and a Django app] so aggressively outweigh the rewards.)
And life on Mailgun was good, for the most part. I sent over a lot of volume, and their pricing was for a while the highest single cost I incurred, but there weren’t that many issues. Their support staff was responsive; the performance wasn’t amazing, but it was also not so bad that I had to actively think about moving. It was, in many ways, the classic SaaS land-and-expand relationship: they got me with their free tier and then I stuck around because the cost of switching was never quite worth it.

Some architectural changes are insisted upon by a single mediating force: maybe you have a bad incident and the obvious remediation is to rebuild a key component, maybe you have a P0 feature or customer ask for which a new architecture is required. 
Most, though, is the opposite: a number of slow factors gradually pushing roi_on_rearchitecture from negative to neutral to positive until you’ve got a sufficiently high level of confidence that the new architecture should be built. This was the case with starting to build out support for multiple ESPs:

Mailgun’s pricing, while not exorbitant, was non-trivially more expensive than Amazon’s (see here). The difference between $30/mo and $10/mo was not that big of a deal — the difference between $600/mo and $200/mo certainly was, especially in the early days of Buttondown where ARPU was relatively low and the number of free users was relatively high.
Postmark, a company that I admired and enjoyed, opened up their ESP rails to handle non-transactional cases.
The number of edge cases that I ran into with Mailgun — custom domains not working with a global proxy, link tracking not working with SSL — increased concomitantly with my usage of it.
Mailgun, like any other SaaS, had a few outages, and I felt retroactively ashamed ¹ that I didn’t have an ESP to which I was ready to fail over.

All of these things and more congealed into an obvious next outcome I wanted to have: equal sending parity between AWS, Postmark, and Mailgun, with the following set of goals:

The ability to turn off an ESP at a moment’s notice in outage scenarios
The ability to slowly migrate from one ESP to another for pricing or performance reasons without degrading service
The ability to arbitrarily gate certain newsletters into certain ESPs for various reasons

The actual implementation of this was mostly 90% boring thanks to django-anymail sanding off a lot of the edges in API differences between the three: you plug in all the API keys, store a field like Newsletter.delivery_provider, and you’re good to go.
That remaining 10%, though…

All three ESPs have divergent concepts of “events”: what constitutes a temporary failure vs. a permanent failure, what constitutes a bounce vs. a drop vs. a rejection, that sort of thing. So I had to build out a generic ingestion pipeline that could take an arbitrarily-shaped ‘event’ and convert it into a standard, user-facing one.
django-anymail does not provide a unified interface for the long tail of escape-hatch options for each ESP (“disable click tracking for just this email”, “add these tags”, “add a list-unsubscribe header” — that kind of thing). This means I have a terrible but functional prepare_email method that takes a RenderedEmail and a delivery_provider and sets all the relevant bits.
Custom domains. Around 40% of Buttondown’s traffic runs through custom domains; the remaining 60% runs through Buttondown’s domains. 

All of this was built out with an eye towards second systems syndrome; new functionality was slowly grafted onto the old system until it was production ready.
The final product is acceptable. I can switch off all traffic from a single ESP in a click of a button (albeit with some disruption to custom domain senders); that’s the really important part. There are lots of rough edges, but most of those rough edges come from bad object-level code (my ways of verifying custom domains, for instance, is just janky for no interesting reasons besides “the code is brittle and bad”) rather than from failures of strategy. 
If I were to do it all over again, I’d do the same thing that I’ve done with rewrites of previous systems: promote stateful items (domains, providers) to top-level models and separate the concerns of tracking & updating state from the concerns of operating & reacting to that state. 

Thanks, Matt, for asking! Next week: “how to roll a new S3 bucket on Christmas Eve”

Shame is a surprisingly good motivator for deep architectural work ↩

Here is a classic blunder: blindly vending DNS requirements from a third-party SaaS to your users. Proxy them through your own DNS instead so you can change them as necessary! ↩

                            Don't miss what's next. Subscribe to Weeknotes from Buttondown: