March 7, 2021
Last week, I mentioned that I was gearing up for a week of frustrating (and hopefully rewarding) Heroku spelunking: dusting off the proverbial (and I guess literal — I had to update a couple dependencies) toolbox and going to town.
…
It was a one-liner, I think.
Not even a satisfying one-liner, the kind of one-line change that you arrive at with a sense of exhausted satisfaction after many hours down in the Performance Mines: a stupid, idiotic one-liner, a “oh my god of course it’s this issue, why didn’t I just look at this sooner”. The this in this case was a timeout set by gunicorn
, the web server I run around Django — it was set to kill any requests after two minutes, whereas Heroku had a hard timeout of thirty seconds.
So, in these rare cases where the server was overloaded or a particularly problematic endpoint was getting hit, Heroku would sever the connection after thirty seconds but my app would chug away for an additional ninety seconds, depriving the Heroku router of a connection and then just dying anyway. This would lead to cascading failures and me getting paged to spin up more hosts.
Since noticing this issue and pushing the one-liner (technically a two-character-er, if we’re counting), I’ve had no such incidents. This could just be good luck — but I’m pretty sure it was the root cause. What’s more, I actually get information in Sentry about what’s timing out now as opposed to those timeouts falling into a black hole.
I know, I know. Better late than never. But I probably spent eight man-hours and a few thousand dollars in aggregate manually scaling up and down the cluster due to these timeouts and I am more than a little beside myself that I didn’t notice or pay attention to this any earlier.
On the other side of the scaling equation, I decided to have a little bit of fun and roll my own autoscaling for workers. Worker load is legitimately spiky; the steady state load is fairly low, punctuated by bursts of thousands of emails being sent out. Here is the full source of that file, for what its worth:
# Notably, `high` includes all outbound emails. We exclude `low`, # which is used for subscriber imports, Slack notifications, and # some email validations. (Yes, we should be more granular here.) QUEUES_TO_CONSIDER_FOR_SCALING = ["high", "default"] logger = logging.getLogger(__name__) # Just so we don't starve ourselves too heavily. MINIMUM_WORKER_COUNT = 3 # This is an arbitrary cutoff, but I'm worried about overloading Redis. MAXIMUM_WORKER_COUNT = 50 # Another arbitrary number that I'll probably want to tinker with at some point. # This number _should_ correspond with something concrete — how many events can # we burn through in a given ten minute period, for instance — but that varies # wildly from enqueued job to job. A longer term solution probably involves getting # very granular with the queues that we use, so we can do things like say "okay, # sending an email takes X sec, so we can process Y per worker-minute, and scale from there". # But...that's a future problem, and we just want some basic scaling for now. # This current value means that if we have a backlog of 10k, we scale to maximum, which seems right. EVENTS_PER_WORKER = 100 # No clever way to handle this but I don't see it changing any time soon! DYNO_SIZE = "standard-1x" # This should correspond with `Procfile`. DYNO_TYPE = "worker" HEROKU_API_ENDPOINT = f"formation/{DYNO_TYPE}" def calculate_backlog_size(queue_names: List[str]) -> int: backlog_size = 0 for queue_name in queue_names: queue = django_rq.get_queue(queue_name) backlog_size += queue.count return backlog_size def calculate_worker_count(backlog_size: int) -> int: estimated_worker_count = backlog_size / EVENTS_PER_WORKER if estimated_worker_count < MINIMUM_WORKER_COUNT: return MINIMUM_WORKER_COUNT if estimated_worker_count > MAXIMUM_WORKER_COUNT: return MAXIMUM_WORKER_COUNT return int(estimated_worker_count) def fetch_worker_count() -> int: """ { "app": { "id": "eb15dbfb-ad89-4736-8f65-054bffa98c53", "name": "buttondown" }, "command": "bin/worker", "created_at": "2017-05-24T04:18:47Z", "id": "b614533b-1cf3-47db-b5eb-2054a0a1db7e", "type": "worker", "quantity": 5, "size": "Standard-1X", "updated_at": "2021-03-02T01:03:47Z" } """ return requests.get( _url(settings.HEROKU_APP_NAME, HEROKU_API_ENDPOINT), headers=_headers() ).json()["quantity"] def set_worker_count(desired_count: int): formation = {"quantity": desired_count, "size": DYNO_SIZE} requests.patch( _url(settings.HEROKU_APP_NAME, HEROKU_API_ENDPOINT), headers=_headers(), data=formation, ) def scale_workers(): backlog_size = calculate_backlog_size(QUEUES_TO_CONSIDER_FOR_SCALING) logger.info(f"Sitting on a backlog of {backlog_size} items") appropriate_number_of_workers = calculate_worker_count(backlog_size) current_number_of_workers = fetch_worker_count() if appropriate_number_of_workers == current_number_of_workers: logger.info( f"Already at appropriate number of workers ({current_number_of_workers})" ) return logger.info( f"Scaling from {current_number_of_workers} to {appropriate_number_of_workers}" ) set_worker_count(appropriate_number_of_workers) class WorkerScalerCronJob(CronJobBase): RUN_EVERY_MINS = 10 schedule = Schedule(run_every_mins=RUN_EVERY_MINS) code = "emails.scale_workers" def do(self): scale_workers()
Really not that bad, all things considered! This has also been live for a week and has worked well. I need to track metrics — in particular, “time from email being sent out to time at which all emails have been delivered in the inbox” — around rollout but I haven’t been pestered for any overlong queues and things seem pretty stable.
Now that some of the low-hanging perf & scaling fruit is apparently behind me, I find myself in the fun and difficult position of choosing a next project. There is some bug bashing that needs to take place, but I think finally tackling multiple-newsletter creation in a less janky way than “log out, create a new newsletter, and paste that API key in” feels like a good candidate. (And no, of course that’s not because I received two support emails about it this morning. Why would you even ask that?)