March 8, 2021, 6:26 a.m.
Last week, I mentioned that I was gearing up for a week of frustrating (and hopefully rewarding) Heroku spelunking: dusting off the proverbial (and I guess literal — I had to update a couple dependencies) toolbox and going to town.
…
It was a one-liner, I think.
Not even a satisfying one-liner, the kind of one-line change that you arrive at with a sense of exhausted satisfaction after many hours down in the Performance Mines: a stupid, idiotic one-liner, a “oh my god of course it’s this issue, why didn’t I just look at this sooner”. The this in this case was a timeout set by gunicorn, the web server I run around Django — it was set to kill any requests after two minutes, whereas Heroku had a hard timeout of thirty seconds.
So, in these rare cases where the server was overloaded or a particularly problematic endpoint was getting hit, Heroku would sever the connection after thirty seconds but my app would chug away for an additional ninety seconds, depriving the Heroku router of a connection and then just dying anyway. This would lead to cascading failures and me getting paged to spin up more hosts.
Since noticing this issue and pushing the one-liner (technically a two-character-er, if we’re counting), I’ve had no such incidents. This could just be good luck — but I’m pretty sure it was the root cause. What’s more, I actually get information in Sentry about what’s timing out now as opposed to those timeouts falling into a black hole.
I know, I know. Better late than never. But I probably spent eight man-hours and a few thousand dollars in aggregate manually scaling up and down the cluster due to these timeouts and I am more than a little beside myself that I didn’t notice or pay attention to this any earlier.
On the other side of the scaling equation, I decided to have a little bit of fun and roll my own autoscaling for workers. Worker load is legitimately spiky; the steady state load is fairly low, punctuated by bursts of thousands of emails being sent out. Here is the full source of that file, for what its worth:
# Notably, `high` includes all outbound emails. We exclude `low`,
# which is used for subscriber imports, Slack notifications, and
# some email validations. (Yes, we should be more granular here.)
QUEUES_TO_CONSIDER_FOR_SCALING = ["high", "default"]
logger = logging.getLogger(__name__)
# Just so we don't starve ourselves too heavily.
MINIMUM_WORKER_COUNT = 3
# This is an arbitrary cutoff, but I'm worried about overloading Redis.
MAXIMUM_WORKER_COUNT = 50
# Another arbitrary number that I'll probably want to tinker with at some point.
# This number _should_ correspond with something concrete — how many events can
# we burn through in a given ten minute period, for instance — but that varies
# wildly from enqueued job to job. A longer term solution probably involves getting
# very granular with the queues that we use, so we can do things like say "okay,
# sending an email takes X sec, so we can process Y per worker-minute, and scale from there".
# But...that's a future problem, and we just want some basic scaling for now.
# This current value means that if we have a backlog of 10k, we scale to maximum, which seems right.
EVENTS_PER_WORKER = 100
# No clever way to handle this but I don't see it changing any time soon!
DYNO_SIZE = "standard-1x"
# This should correspond with `Procfile`.
DYNO_TYPE = "worker"
HEROKU_API_ENDPOINT = f"formation/{DYNO_TYPE}"
def calculate_backlog_size(queue_names: List[str]) -> int:
backlog_size = 0
for queue_name in queue_names:
queue = django_rq.get_queue(queue_name)
backlog_size += queue.count
return backlog_size
def calculate_worker_count(backlog_size: int) -> int:
estimated_worker_count = backlog_size / EVENTS_PER_WORKER
if estimated_worker_count < MINIMUM_WORKER_COUNT:
return MINIMUM_WORKER_COUNT
if estimated_worker_count > MAXIMUM_WORKER_COUNT:
return MAXIMUM_WORKER_COUNT
return int(estimated_worker_count)
def fetch_worker_count() -> int:
"""
{
"app": {
"id": "eb15dbfb-ad89-4736-8f65-054bffa98c53",
"name": "buttondown"
},
"command": "bin/worker",
"created_at": "2017-05-24T04:18:47Z",
"id": "b614533b-1cf3-47db-b5eb-2054a0a1db7e",
"type": "worker",
"quantity": 5,
"size": "Standard-1X",
"updated_at": "2021-03-02T01:03:47Z"
}
"""
return requests.get(
_url(settings.HEROKU_APP_NAME, HEROKU_API_ENDPOINT), headers=_headers()
).json()["quantity"]
def set_worker_count(desired_count: int):
formation = {"quantity": desired_count, "size": DYNO_SIZE}
requests.patch(
_url(settings.HEROKU_APP_NAME, HEROKU_API_ENDPOINT),
headers=_headers(),
data=formation,
)
def scale_workers():
backlog_size = calculate_backlog_size(QUEUES_TO_CONSIDER_FOR_SCALING)
logger.info(f"Sitting on a backlog of {backlog_size} items")
appropriate_number_of_workers = calculate_worker_count(backlog_size)
current_number_of_workers = fetch_worker_count()
if appropriate_number_of_workers == current_number_of_workers:
logger.info(
f"Already at appropriate number of workers ({current_number_of_workers})"
)
return
logger.info(
f"Scaling from {current_number_of_workers} to {appropriate_number_of_workers}"
)
set_worker_count(appropriate_number_of_workers)
class WorkerScalerCronJob(CronJobBase):
RUN_EVERY_MINS = 10
schedule = Schedule(run_every_mins=RUN_EVERY_MINS)
code = "emails.scale_workers"
def do(self):
scale_workers()
Really not that bad, all things considered! This has also been live for a week and has worked well. I need to track metrics — in particular, “time from email being sent out to time at which all emails have been delivered in the inbox” — around rollout but I haven’t been pestered for any overlong queues and things seem pretty stable.
Now that some of the low-hanging perf & scaling fruit is apparently behind me, I find myself in the fun and difficult position of choosing a next project. There is some bug bashing that needs to take place, but I think finally tackling multiple-newsletter creation in a less janky way than “log out, create a new newsletter, and paste that API key in” feels like a good candidate. (And no, of course that’s not because I received two support emails about it this morning. Why would you even ask that?)