ArticlesEngineering

Declarative messaging

A framework and pattern of designing communication systems.

In 2019, Stripe needed to send some new emails. So I was there :)

A lot of what I find interesting about working at Stripe is our focus on users and how we engage with them. To people not on Stripe, we want to be seen as moving quickly, innovating, and building and maintaining trustworthy, reliable systems. To people on Stripe, we want that and to be seen as solving away major problems in running a business.

"Redesigning a plane while it is in flight" is the Stripe day to day. Laws and regulations change, payment methods die and are born again, and tax law is tax law. The value prop of Stripe is not needing to know this, as a user.

What this means in regard to communicating to users, is that reaching out and making them do anything is not great. And so we don't send a lot of emails and it is no one's goal to send a lot of emails. I'm a fan and strong advocate of not wasting people's time, and that reflects in our messaging.

So, anyway, we did need to send some (okay, a lot of) emails. Better make it good, then!

State of the art

We'd sent lots of emails prior to 2019, so there's tons of history. What we were looking to build next was a multi-step email campaign, synced with messaging via notifications in the Dashboard and even SMS. Based on the actions people took, or didn't take in time, we would send more emails or change account functionality (very sensitive!).

Previous campaigns were sent manually and tracked in spreadsheets, run as inefficient, offline data migrations, or used some convoluted flow of emitting events based on actions and heartbeats on a schedule. The last option was particularly gnarly as it would "infect" the core business logic, and being sprinkled about made it hard to reason about and change.

Coming from a web development background I thought, "Is this Backbone?" Event listeners/emitters all over, state tracking in a whole bunch of places, no lifecycle management, and terribly complex testing.

I wanted to bring some clarity to this madness for this next set of campaigns, and future campaigns. Lots of one-off frameworks cropped up over the years, and I reviewed them all to see what worked and what didn't. Ultimately, each system devolved to optimizing a single use case or set of use cases the supporting team had. What could I glean to make something general purpose for everyone to be able to use?

I wanted something testable. These campaigns would be serious business and while there's always a constant push in the messaging space for going more and more no-code in the tooling, having systems in code lets us write tests. I wanted the tests for campaigns to be easy to write and comprehensive. More precisely, since the business logic itself could not be reduced further (complex things are complex), making the communication part of the problem easier was my aim.

I've found time and time again, designing something testable yields something simple. In some ways, it is a matter of incentives: if it is easier to write tests, more tests will be written. If it's easy to write tests, that means there's not much ceremony (e.g. no FactoryFactories) or boilerplate. If there's not much boilerplate, it means we're working with the good stuff: simple functions and data.

React for messaging

Backbone was replaced with React, so I went on an API design journey to build a React but for communication systems. The central premise I built on was, "What should the user be looking at right now?" Based on external state we'd have a function returning messaging as data:

function render(state) {
  if (state.userHasUpdatedCompanyOwnership) {
    return messaging({ status: 'finished' })
  }

  return messaging({
    status: 'needs_to_update',
    email: {
      key: 'reminder',
      subject: '[Action required] Update company ownership',
      // ...
    },
  })
}

This is a simple example. We are trying to have a user update their account info. If they have done that we have nothing to show them. If they haven't they should get an email.

The key is required and used for idempotency. Sending the email will be recorded in the database and subsequent emails matching that key won't be sent. This is similar to React's DOM diffing, applying only what has changed by comparing the old and new render tree.

The status is also required but is more of a convenience label for observability into the campaign. We always want to track the progression of campaigns, and use the status to aggregate users into cohorts.

But render is totally user-defined, you can call it a million times and it would do nothing but return plain data. This is perfect for tests and left the actual messaging to the framework, which provided two methods:

  • applyCommunications({campaignType, userId, messaging}) to diff against the current state and trigger emails and other communications if necessary.
  • getCommunications({campaignType, userId}) to receive the trail of sent messages. This could be used to inform the external state passed to render.

The campaignType is a way of isolating campaigns from one another.

Using this framework has many benefits:

  • Testing is easy and fast. The render function contains just business logic and returns data. So tests didn't need to verify what would happen end-to-end and were simple to construct and decoupled from database and network reads/writes. Developers could trust the underlying framework to do the right thing and not re-test those internals.

  • Supporting multiple channels is easy. In the example above, we have an email: slot. The framework's design easily accommodates other channels, like text messaging:

    messaging({
      status: 'urgently_needs_to_update',
      text_message: {
        key: 'urgent_alert',
        text: 'Yo, like seriously, you gotta update your account.',
      },
    })
    

    One cool limitation in this particular interface which I love is you can't send two emails about a campaign at once. By only having single email: and text_message: slots, we have guarded against a bad practice of sending conflicting or too many messages at each step in a campaign.

  • Campaigns are described in one place. The render function tells the whole story of what contributes to what a user is seeing. The calculation of state could be super complicated but the communications stay simple.

  • Functions provide the composition. Helpers and utilities building around plain functions and data are the easiest to write. We can build an ecosystem of reusable components. To break up a complex system, we break it into more render functions.

  • Good baseline observability. The status and the paper trail of messages sent by campaign to a user allows us to provide tooling and dashboards for free showing where a campaign is and detailed breakdowns for individual users. Each campaign used to require manual work to track itself, but now all someone needs is the campaignType to start digging.

This is fundamentally what was built, but it's worth calling out some complexities. There are things to consider beyond what's described here:

  • When is render called? A framework like React controls the state and provides setState so it can see when it changes. In a backend you could build something to watch the database I suppose. In practice we leverage events to trigger render, or if the passage of time factors in, trigger render via a cron job. We encapsulated render and applying the messages together to prevent mistakes of calling the wrong render for a campaign while still keeping it testable in isolation.

    In general, being agnostic here allows campaign owners to expend the exact amount of compute they feel necessary to run a good campaign.

  • State is more permanent. In a front-end framework, if you screw up in your render method, the fix is to refresh the page. For messaging, state is much more long lived and for most messaging channels you can't scrape those messages out of people's inboxes once sent.

    Testing and making regular changes safe and easy is more important to a successful project. Prioritizing observability, metrics, detectors, kill switches, etc is prudent to safe operation. Luckily much of this can be provided by the framework and common tooling.

Finally, I'll leave you with the coolest thing about this design: dark testing. We encode what the user should be seeing right now, but oftentimes a messaging campaign's greatest enemy is assuming we actually modeled our system correctly.

We have a stateless render at our disposal so we can have tons of fun simulating campaigns before they happen, against production data. Running them in the dark against users, but not showing or sending users any messaging, to gain confidence in how production will behave.

You can build this simulation tooling in many ways, but when all you have is engineers stringing complex messaging flows together by hand, no one is going to build this out as they go. Designed this way, it's simply yet another feature of the framework and its tooling.

I get really excited about this style of encoding messaging systems. They can be super expressive and easy to test, enabling good separation of concerns and some sick tooling. I haven't come across this elsewhere since I built it up in 2019, so I wanted to share!