March 7, 2021

Last week, I mentioned that I was gearing up for a week of frustrating (and hopefully rewarding) Heroku spelunking: dusting off the proverbial (and I guess literal — I had to update a couple dependencies) toolbox and going to town.

…

It was a one-liner, I think.

Not even a satisfying one-liner, the kind of one-line change that you arrive at with a sense of exhausted satisfaction after many hours down in the Performance Mines: a stupid, idiotic one-liner, a “oh my god of course it’s this issue, why didn’t I just look at this sooner”. The this in this case was a timeout set by gunicorn, the web server I run around Django — it was set to kill any requests after two minutes, whereas Heroku had a hard timeout of thirty seconds.

So, in these rare cases where the server was overloaded or a particularly problematic endpoint was getting hit, Heroku would sever the connection after thirty seconds but my app would chug away for an additional ninety seconds, depriving the Heroku router of a connection and then just dying anyway. This would lead to cascading failures and me getting paged to spin up more hosts.

Since noticing this issue and pushing the one-liner (technically a two-character-er, if we’re counting), I’ve had no such incidents. This could just be good luck — but I’m pretty sure it was the root cause. What’s more, I actually get information in Sentry about what’s timing out now as opposed to those timeouts falling into a black hole.

I know, I know. Better late than never. But I probably spent eight man-hours and a few thousand dollars in aggregate manually scaling up and down the cluster due to these timeouts and I am more than a little beside myself that I didn’t notice or pay attention to this any earlier.

On the other side of the scaling equation, I decided to have a little bit of fun and roll my own autoscaling for workers. Worker load is legitimately spiky; the steady state load is fairly low, punctuated by bursts of thousands of emails being sent out. Here is the full source of that file, for what its worth:

# Notably, `high` includes all outbound emails. We exclude `low`,
# which is used for subscriber imports, Slack notifications, and
# some email validations. (Yes, we should be more granular here.)
QUEUES_TO_CONSIDER_FOR_SCALING = ["high", "default"]

logger = logging.getLogger(__name__)

# Just so we don't starve ourselves too heavily.
MINIMUM_WORKER_COUNT = 3

# This is an arbitrary cutoff, but I'm worried about overloading Redis.
MAXIMUM_WORKER_COUNT = 50

# Another arbitrary number that I'll probably want to tinker with at some point.
# This number _should_ correspond with something concrete — how many events can
# we burn through in a given ten minute period, for instance — but that varies
# wildly from enqueued job to job. A longer term solution probably involves getting
# very granular with the queues that we use, so we can do things like say "okay,
# sending an email takes X sec, so we can process Y per worker-minute, and scale from there".
# But...that's a future problem, and we just want some basic scaling for now.
# This current value means that if we have a backlog of 10k, we scale to maximum, which seems right.
EVENTS_PER_WORKER = 100

# No clever way to handle this but I don't see it changing any time soon!
DYNO_SIZE = "standard-1x"

# This should correspond with `Procfile`.
DYNO_TYPE = "worker"

HEROKU_API_ENDPOINT = f"formation/{DYNO_TYPE}"


def calculate_backlog_size(queue_names: List[str]) -> int:
    backlog_size = 0
    for queue_name in queue_names:
        queue = django_rq.get_queue(queue_name)
        backlog_size += queue.count
    return backlog_size


def calculate_worker_count(backlog_size: int) -> int:
    estimated_worker_count = backlog_size / EVENTS_PER_WORKER
    if estimated_worker_count < MINIMUM_WORKER_COUNT:
        return MINIMUM_WORKER_COUNT
    if estimated_worker_count > MAXIMUM_WORKER_COUNT:
        return MAXIMUM_WORKER_COUNT
    return int(estimated_worker_count)


def fetch_worker_count() -> int:
    """
      {
        "app": {
          "id": "eb15dbfb-ad89-4736-8f65-054bffa98c53",
          "name": "buttondown"
        },
        "command": "bin/worker",
        "created_at": "2017-05-24T04:18:47Z",
        "id": "b614533b-1cf3-47db-b5eb-2054a0a1db7e",
        "type": "worker",
        "quantity": 5,
        "size": "Standard-1X",
        "updated_at": "2021-03-02T01:03:47Z"
      }
    """
    return requests.get(
        _url(settings.HEROKU_APP_NAME, HEROKU_API_ENDPOINT), headers=_headers()
    ).json()["quantity"]


def set_worker_count(desired_count: int):
    formation = {"quantity": desired_count, "size": DYNO_SIZE}
    requests.patch(
        _url(settings.HEROKU_APP_NAME, HEROKU_API_ENDPOINT),
        headers=_headers(),
        data=formation,
    )


def scale_workers():
    backlog_size = calculate_backlog_size(QUEUES_TO_CONSIDER_FOR_SCALING)
    logger.info(f"Sitting on a backlog of {backlog_size} items")

    appropriate_number_of_workers = calculate_worker_count(backlog_size)
    current_number_of_workers = fetch_worker_count()
    if appropriate_number_of_workers == current_number_of_workers:
        logger.info(
            f"Already at appropriate number of workers ({current_number_of_workers})"
        )
        return

    logger.info(
        f"Scaling from {current_number_of_workers} to {appropriate_number_of_workers}"
    )
    set_worker_count(appropriate_number_of_workers)


class WorkerScalerCronJob(CronJobBase):
    RUN_EVERY_MINS = 10

    schedule = Schedule(run_every_mins=RUN_EVERY_MINS)
    code = "emails.scale_workers"

    def do(self):
        scale_workers()

Really not that bad, all things considered! This has also been live for a week and has worked well. I need to track metrics — in particular, “time from email being sent out to time at which all emails have been delivered in the inbox” — around rollout but I haven’t been pestered for any overlong queues and things seem pretty stable.

Now that some of the low-hanging perf & scaling fruit is apparently behind me, I find myself in the fun and difficult position of choosing a next project. There is some bug bashing that needs to take place, but I think finally tackling multiple-newsletter creation in a less janky way than “log out, create a new newsletter, and paste that API key in” feels like a good candidate. (And no, of course that’s not because I received two support emails about it this morning. Why would you even ask that?)

Weeknotes from Buttondown