March 8, 2021, 6:26 a.m.

March 7, 2021

Weeknotes from Buttondown

Last week, I mentioned that I was gearing up for a week of frustrating (and hopefully rewarding) Heroku spelunking: dusting off the proverbial (and I guess literal — I had to update a couple dependencies) toolbox and going to town.

…

It was a one-liner, I think.

Not even a satisfying one-liner, the kind of one-line change that you arrive at with a sense of exhausted satisfaction after many hours down in the Performance Mines: a stupid, idiotic one-liner, a “oh my god of course it’s this issue, why didn’t I just look at this sooner”. The this in this case was a timeout set by gunicorn, the web server I run around Django — it was set to kill any requests after two minutes, whereas Heroku had a hard timeout of thirty seconds.

So, in these rare cases where the server was overloaded or a particularly problematic endpoint was getting hit, Heroku would sever the connection after thirty seconds but my app would chug away for an additional ninety seconds, depriving the Heroku router of a connection and then just dying anyway. This would lead to cascading failures and me getting paged to spin up more hosts.

Since noticing this issue and pushing the one-liner (technically a two-character-er, if we’re counting), I’ve had no such incidents. This could just be good luck — but I’m pretty sure it was the root cause. What’s more, I actually get information in Sentry about what’s timing out now as opposed to those timeouts falling into a black hole.

I know, I know. Better late than never. But I probably spent eight man-hours and a few thousand dollars in aggregate manually scaling up and down the cluster due to these timeouts and I am more than a little beside myself that I didn’t notice or pay attention to this any earlier.

On the other side of the scaling equation, I decided to have a little bit of fun and roll my own autoscaling for workers. Worker load is legitimately spiky; the steady state load is fairly low, punctuated by bursts of thousands of emails being sent out. Here is the full source of that file, for what its worth:

# Notably, `high` includes all outbound emails. We exclude `low`,
# which is used for subscriber imports, Slack notifications, and
# some email validations. (Yes, we should be more granular here.)
QUEUES_TO_CONSIDER_FOR_SCALING = ["high", "default"]

logger = logging.getLogger(__name__)

# Just so we don't starve ourselves too heavily.
MINIMUM_WORKER_COUNT = 3

# This is an arbitrary cutoff, but I'm worried about overloading Redis.
MAXIMUM_WORKER_COUNT = 50

# Another arbitrary number that I'll probably want to tinker with at some point.
# This number _should_ correspond with something concrete — how many events can
# we burn through in a given ten minute period, for instance — but that varies
# wildly from enqueued job to job. A longer term solution probably involves getting
# very granular with the queues that we use, so we can do things like say "okay,
# sending an email takes X sec, so we can process Y per worker-minute, and scale from there".
# But...that's a future problem, and we just want some basic scaling for now.
# This current value means that if we have a backlog of 10k, we scale to maximum, which seems right.
EVENTS_PER_WORKER = 100

# No clever way to handle this but I don't see it changing any time soon!
DYNO_SIZE = "standard-1x"

# This should correspond with `Procfile`.
DYNO_TYPE = "worker"

HEROKU_API_ENDPOINT = f"formation/{DYNO_TYPE}"


def calculate_backlog_size(queue_names: List[str]) -> int:
    backlog_size = 0
    for queue_name in queue_names:
        queue = django_rq.get_queue(queue_name)
        backlog_size += queue.count
    return backlog_size


def calculate_worker_count(backlog_size: int) -> int:
    estimated_worker_count = backlog_size / EVENTS_PER_WORKER
    if estimated_worker_count < MINIMUM_WORKER_COUNT:
        return MINIMUM_WORKER_COUNT
    if estimated_worker_count > MAXIMUM_WORKER_COUNT:
        return MAXIMUM_WORKER_COUNT
    return int(estimated_worker_count)


def fetch_worker_count() -> int:
    """
      {
        "app": {
          "id": "eb15dbfb-ad89-4736-8f65-054bffa98c53",
          "name": "buttondown"
        },
        "command": "bin/worker",
        "created_at": "2017-05-24T04:18:47Z",
        "id": "b614533b-1cf3-47db-b5eb-2054a0a1db7e",
        "type": "worker",
        "quantity": 5,
        "size": "Standard-1X",
        "updated_at": "2021-03-02T01:03:47Z"
      }
    """
    return requests.get(
        _url(settings.HEROKU_APP_NAME, HEROKU_API_ENDPOINT), headers=_headers()
    ).json()["quantity"]


def set_worker_count(desired_count: int):
    formation = {"quantity": desired_count, "size": DYNO_SIZE}
    requests.patch(
        _url(settings.HEROKU_APP_NAME, HEROKU_API_ENDPOINT),
        headers=_headers(),
        data=formation,
    )


def scale_workers():
    backlog_size = calculate_backlog_size(QUEUES_TO_CONSIDER_FOR_SCALING)
    logger.info(f"Sitting on a backlog of {backlog_size} items")

    appropriate_number_of_workers = calculate_worker_count(backlog_size)
    current_number_of_workers = fetch_worker_count()
    if appropriate_number_of_workers == current_number_of_workers:
        logger.info(
            f"Already at appropriate number of workers ({current_number_of_workers})"
        )
        return

    logger.info(
        f"Scaling from {current_number_of_workers} to {appropriate_number_of_workers}"
    )
    set_worker_count(appropriate_number_of_workers)


class WorkerScalerCronJob(CronJobBase):
    RUN_EVERY_MINS = 10

    schedule = Schedule(run_every_mins=RUN_EVERY_MINS)
    code = "emails.scale_workers"

    def do(self):
        scale_workers()

Really not that bad, all things considered! This has also been live for a week and has worked well. I need to track metrics — in particular, “time from email being sent out to time at which all emails have been delivered in the inbox” — around rollout but I haven’t been pestered for any overlong queues and things seem pretty stable.

Now that some of the low-hanging perf & scaling fruit is apparently behind me, I find myself in the fun and difficult position of choosing a next project. There is some bug bashing that needs to take place, but I think finally tackling multiple-newsletter creation in a less janky way than “log out, create a new newsletter, and paste that API key in” feels like a good candidate. (And no, of course that’s not because I received two support emails about it this morning. Why would you even ask that?)

You just read issue #40 of Weeknotes from Buttondown. You can also browse the full archives of this newsletter.

Share on Facebook Share on Twitter Share via email
Brought to you by Buttondown, the easiest way to start and grow your newsletter.