A framework and pattern of designing communication systems.
In 2019, Stripe needed to send some new emails. So I was there :)
A lot of what I find interesting about working at Stripe is our focus on users and how we engage with them. To people not on Stripe, we want to be seen as moving quickly, innovating, and building and maintaining trustworthy, reliable systems. To people on Stripe, we want that and to be seen as solving away major problems in running a business.
"Redesigning a plane while it is in flight" is the Stripe day to day. Laws and regulations change, payment methods die and are born again, and tax law is tax law. The value prop of Stripe is not needing to know this, as a user.
What this means in regard to communicating to users, is that reaching out and making them do anything is not great. And so we don't send a lot of emails and it is no one's goal to send a lot of emails. I'm a fan and strong advocate of not wasting people's time, and that reflects in our messaging.
So, anyway, we did need to send some (okay, a lot of) emails. Better make it good, then!
We'd sent lots of emails prior to 2019, so there's tons of history. What we were looking to build next was a multi-step email campaign, synced with messaging via notifications in the Dashboard and even SMS. Based on the actions people took, or didn't take in time, we would send more emails or change account functionality (very sensitive!).
Previous campaigns were sent manually and tracked in spreadsheets, run as inefficient, offline data migrations, or used some convoluted flow of emitting events based on actions and heartbeats on a schedule. The last option was particularly gnarly as it would "infect" the core business logic, and being sprinkled about made it hard to reason about and change.
Coming from a web development background I thought, "Is this Backbone?" Event listeners/emitters all over, state tracking in a whole bunch of places, no lifecycle management, and terribly complex testing.
I wanted to bring some clarity to this madness for this next set of campaigns, and future campaigns. Lots of one-off frameworks cropped up over the years, and I reviewed them all to see what worked and what didn't. Ultimately, each system devolved to optimizing a single use case or set of use cases the supporting team had. What could I glean to make something general purpose for everyone to be able to use?
I wanted something testable. These campaigns would be serious business and while there's always a constant push in the messaging space for going more and more no-code in the tooling, having systems in code lets us write tests. I wanted the tests for campaigns to be easy to write and comprehensive. More precisely, since the business logic itself could not be reduced further (complex things are complex), making the communication part of the problem easier was my aim.
I've found time and time again, designing something testable yields something simple. In some ways, it is a matter of incentives: if it is easier to write tests, more tests will be written. If it's easy to write tests, that means there's not much ceremony (e.g. no FactoryFactories) or boilerplate. If there's not much boilerplate, it means we're working with the good stuff: simple functions and data.
Backbone was replaced with React, so I went on an API design journey to build a React but for communication systems. The central premise I built on was, "What should the user be looking at right now?" Based on external state we'd have a function returning messaging as data:
function render(state) {
if (state.userHasUpdatedCompanyOwnership) {
return messaging({ status: 'finished' })
}
return messaging({
status: 'needs_to_update',
email: {
key: 'reminder',
subject: '[Action required] Update company ownership',
// ...
},
})
}
This is a simple example. We are trying to have a user update their account info. If they have done that we have nothing to show them. If they haven't they should get an email.
The key
is required and used for idempotency.
Sending the email will be recorded in the database and
subsequent emails matching that key won't be sent. This is
similar to React's DOM diffing, applying only what has changed
by comparing the old and new render tree.
The status
is also required but is more of a
convenience label for observability into the campaign. We always
want to track the progression of campaigns, and use the status
to aggregate users into cohorts.
But render
is totally user-defined, you can call it
a million times and it would do nothing but return plain data.
This is perfect for tests and left the actual messaging to the
framework, which provided two methods:
applyCommunications({campaignType, userId,
messaging})
to diff against the current state and trigger emails and other
communications if necessary.
getCommunications({campaignType, userId})
to
receive the trail of sent messages. This could be used to
inform the external state passed to render
.
The campaignType
is a way of isolating campaigns
from one another.
Using this framework has many benefits:
Testing is easy and fast. The
render
function contains just business logic
and returns data. So tests didn't need to verify what would
happen end-to-end and were simple to construct and decoupled
from database and network reads/writes. Developers could
trust the underlying framework to do the right thing and not
re-test those internals.
Supporting multiple channels is easy. In
the example above, we have an email:
slot. The
framework's design easily accommodates other channels, like
text messaging:
messaging({
status: 'urgently_needs_to_update',
text_message: {
key: 'urgent_alert',
text: 'Yo, like seriously, you gotta update your account.',
},
})
One cool limitation in this particular interface which I
love is you can't send two emails about a campaign at once.
By only having single email:
and
text_message:
slots, we have guarded against a
bad practice of sending conflicting or too many messages at
each step in a campaign.
Campaigns are described in one place. The
render
function tells the whole story of what
contributes to what a user is seeing. The calculation of
state could be super complicated but the communications stay
simple.
Functions provide the composition. Helpers and utilities building around plain functions and data are the easiest to write. We can build an ecosystem of reusable components. To break up a complex system, we break it into more render functions.
Good baseline observability. The
status
and the paper trail of messages sent by
campaign to a user allows us to provide tooling and
dashboards for free showing where a campaign is and detailed
breakdowns for individual users. Each campaign used to
require manual work to track itself, but now all someone
needs is the campaignType
to start digging.
This is fundamentally what was built, but it's worth calling out some complexities. There are things to consider beyond what's described here:
When is render
called? A
framework like React controls the state and provides
setState
so it can see when it changes. In a
backend you could build something to watch the database I
suppose. In practice we leverage events to trigger
render
, or if the passage of time factors in,
trigger render
via a cron job. We encapsulated
render
and applying the messages together to
prevent mistakes of calling the wrong
render
for a campaign while still keeping it
testable in isolation.
In general, being agnostic here allows campaign owners to expend the exact amount of compute they feel necessary to run a good campaign.
State is more permanent. In a front-end
framework, if you screw up in your
render
method, the fix is to refresh the page.
For messaging, state is much more long lived and for most
messaging channels you can't scrape those messages out of
people's inboxes once sent.
Testing and making regular changes safe and easy is more important to a successful project. Prioritizing observability, metrics, detectors, kill switches, etc is prudent to safe operation. Luckily much of this can be provided by the framework and common tooling.
Finally, I'll leave you with the coolest thing about this design: dark testing. We encode what the user should be seeing right now, but oftentimes a messaging campaign's greatest enemy is assuming we actually modeled our system correctly.
We have a stateless render
at our disposal so we
can have tons of fun simulating campaigns before they
happen, against production data. Running them in the dark
against users, but not showing or sending users any messaging,
to gain confidence in how production will behave.
You can build this simulation tooling in many ways, but when all you have is engineers stringing complex messaging flows together by hand, no one is going to build this out as they go. Designed this way, it's simply yet another feature of the framework and its tooling.
I get really excited about this style of encoding messaging systems. They can be super expressive and easy to test, enabling good separation of concerns and some sick tooling. I haven't come across this elsewhere since I built it up in 2019, so I wanted to share!