Data onboarding

done data onboarding

                November 7, 2022

            Data onboarding

            “Data onboarding” is a slightly bureaucratic name for a very boring and important process — bringing data from Service A to Service B when a customer joins Service B having used Service A. You have probably done data onboarding many times in your life, whether its uploading a list of contacts to a new social app or whatever.
Buttondown has two tranches of data onboarding:

importing subscribers. This is, all things said, pretty easy, thanks to the unimpeachable portability of email addresses. There are occasional edge cases: heavy use of metadata, esoteric paid subscriptions, that kind of thing, but Buttondown’s current solution here (a bespoke CSV parser with some bulk validation at the backend) is pretty good, and the number of times I have to really do something manual here is fairly low.
importing archives. HTML is, while portable, very gross, and in general folks import archives at a rate of (according to an SQL query I hammered out just now) one tenth of the rate at which they import subscribers. This is for a number of reasons: some people don’t care that much about bringing their archives over from a previous service, some people use a blog or other website as a ‘source of truth’ for their archives, whatever.

The combination of “not many people do this” and “it’s very annoying to make this better” is generally a sign that I neglect whatever that thing is, and that heuristic holds true for archive imports. Some services (Substack, Wordpress, Mailchimp) have fairly standardized exports that include archives, which makes my importers for them pretty reasonable and durable. Some services — like, say, Revue — do not, and force me to crawl + parse them to get something approaching a reasonable result. Or, in Revue’s case, a pretty unreasonable result: Buttondown’s ability to import archives from Revue is frankly pretty bad.
Depending on how tuned in you are to the Twitter news cycle, you might see where this is going. Twitter is planning on shuttering Revue and I have been flooded with requests to improve archive imports. So it goes.
This has actually been a pretty fun exercise in cutting a V2 of a feature. I’m going with a rules-based approach of crawling over every single node in a page, checking to see if it is a piece of content I care about, and then applying a transformation on it. Here’s a rough draft of what works (though is incomplete — as I write this I’m missing support for embedded links and blockquotes):
@dataclass
class Rule:
    name: str
    test: Callable[[Any], bool]
    action: Callable[[Any], Any]

def transform_p(element: Any) -> str:
    element.name = "p"
    del element["style"]
    del element["class"]
    return element.__str__()

def transform_header(element: Any) -> str:
    element.name = "h2"
    return element.__str__()

def transform_ul(element: Any) -> str:
    del element["class"]
    return element.__str__()

RULES = [
    Rule("p", lambda element: "revue-p" in element.get("class", []), transform_p),
    Rule("header", lambda element: "header-text" in element.get("class", []) or "revue-h2" in element.get("class", []), transform_header),
    Rule("ul", lambda element: "revue-ul" in element.get("class", []), transform_ul),
]

def extract_body(email_html: BeautifulSoup) -> str:
    text = ""
    for element in email_html.select("*"):
        for rule in RULES:
            if rule.test(element):
                text += rule.action(element) + "\n\n"
    return text

I also am taking the opportunity to spin up a little debug view to make it easier to run side-by-side comparisons between the ‘real’ content and Buttondown’s parsed version of it:

Why do I not make more of these views? I do not know. Every time I take the fifteen minutes to spin one up it yields a tremendous result in terms of efficiency and flow. 
Anyhoo: this was a fairly quiet week. I mentioned last week that my thoughts are turning to operational efficiency & documentation and I’m starting to slowly populate a runbook and spin up a HelpScout account (long-time readers may scoff at this, as they remember me spending a year on ZenDesk with nothing really to show for it.) That’s going well. Sales were very strong due to this whole Revue fracas (I have written a sentence along the lines of “it turns out a very strong value proposition is ‘pay me money and I will keep your tool online & functional’” at least ten times.) 
More next week about the HelpScout onboarding process, and about me tricking myself into already thinking about a new billing model.

                Don't miss what's next. Subscribe to Weeknotes from Buttondown: