System design interviews

Table of contents

Valuable signals
- Simplicity
- Solving the right problem
Presenting a design
Conclusion

For the past couple years I've worked on Stripe's engineering interview committee. Our remit is the design, scoring, and curation of technical interview questions for engineering individual contributors (ICs) and engineering managers (EMs). My areas of ownership/responsibility are:

Some of our internal tooling for engineers participating in the hiring process. All engineering ICs beyond L1/entry-level participate in the hiring process for other engineers. We conduct interviews and score candidates using rubrics. These rubrics are designed by the committee and volunteers. We build and maintain some tools for engineers to apply to spin-up on new interviews and see which interviews they are trained up for.
I also steward the system design interviews Stripe uses. When I interview candidates this is my exclusive focus and I really enjoy it. On the committee, I've facilitated rubric changes. I advocate, explain, and iterate on the signals we try to capture using our system design questions.

This article explains:

The signals I've found in practice to demonstrate a firm understanding of system design, and
How capable candidates can hit on these signals by being more deliberate.

This is not an article on how to hack Stripe's interview process. This is because fortunately the rubric puts emphasis on practical experience and well reasoned decisions, not specific design patterns, algorithms, or buzzwords. We poke, prod, and push candidates on their decisions: it's difficult to rehearse or fake it for such a dynamic interview.

Valuable signals

The core components I value in any design are:

Simplicity

No component of the system should lack a compelling purpose. The purpose needs to be more compelling than the added complexity it brings. Carelessly added or unjustified components often take the shape of complex data pipelines, caches, or unnecessary redundancy.

Unfamiliar technologies are perfectly alright but they must be presented in terms of "first principles." For example, someone may articulate a special/custom database for the presented problem, but they must not pepper over those essential truths (correctness, latency, atomicity, CAP theorem, time- and space-complexities of data structures, etc) under the guise of "it works" or "it's magic."

Solving the right problem

Alternatively put: have the design fit the requirements well. There are the basic functional requirements, then the non-functional, and so on to touch organizational concerns and as you scale trading off multiple competing goals against each other.

Sometimes the goals are obvious, other times you should lean on your interviewer to be the oracle of user needs and organizational context.

Presenting a design

Okay, so how can a capable candidate present their design to maximally hit on these signals?

Simplicity

Software engineering interviews have a bad reputation. Puzzles, algorithm questions, and trick questions. Most candidates' instincts are to demonstrate a vocabulary, not an in-depth understanding of problem solving. This often leads to candidates out of the gate rattling off: cache, queue, load balancer, shards, consistent hashing, and cache again. This will lose you points massively when a measure is simplicity.

Instead, I always tell candidates upfront that they should start with the simplest thing possible. Understand the functional requirements and hit them. This should be really quick. (If you are asked to diagram a system, draw this out to start with.) Demonstrating you understand what can be the simplest thing possible speaks volumes.

This style is advantageous because complexity is more palatable when it is justified. With a basic structure in place you can now move from merely solving the problem to solving it well. You can now always make a trade-off decision: to add something else you must trade it off against simplicity, which should be hard to do because you should value simplicity. This sets you up to tell a good story.

Solving the right problem

Oftentimes in larger engineering organizations, we tend to form groups of more specialized roles going from generalist to ‘front-end, back-end‘ to ‘product, platform, infra’ and so on. This layering is a means of organizing and making work, to some degree of well, people who often aren’t presently capable of being those do-it-all generalists. (And of course individual specialists in the field are truly temporarily irreplaceable, but any engineering organization ought to actively mitigate this as a risk to the business.) The really strong engineers don’t exist in one of these single bubbles, they must transcend them to some degree to make really successful systems.

This is something that I’ve come to learn really well working on platform teams for more than half a decade. Achieving a successful design is not solving for your immediate internal stakeholders most of the time. Instead it is extracting the essence of what the very best internal teams are doing for external customers and commoditizing it for the rest of the organization to leverage. You must make internal stakeholders happy, but they should only be happy doing the work that will make the end user happy. It’s an interesting game of guard rails and incentives.

That is one example, another is designing for your future self or for the future team which will be tasked with making changes.

Through this multi-faceted lens you can ask what people are trying to do and why and how your system can help (if it should).

Now this is all rather meta, and in an interview you’ll have even less to come close to articulating a narrative such as this. What we can do tactically is:

Narrow scope. Cut, cut, cut anything not essential to the tasks needed. Do not add any additional features or nice to haves. You will not have time to do them justice.
Call out what the system won’t be for, provide an example of some use case that won’t be served.
Enumerate the stakeholders. You don’t have to perfectly embody each one. State that there are more stakeholders than might meet the eye.

Approach

With a simple design for a well-defined problem, how you then solve it well is up to you. Presenting a philosophy and progression is a great way to demonstrate this isn’t your first rodeo. Here’s mine, as in what I solve and in what order:

Correctness
Error handling
Reliability
Scale
Maintenance

Let’s lightly map these out in more detail.

Note that it is very unlikely I could actually get through all of these considerations in a design interview due to time constraints (Stripe’s system design interview is only 45 minutes), but laying out the process and not finishing due to time has value. Lay out a framework, follow it loosely, highlighting some considerations so the interviewer can understand that you have breadth because you have a framework but also depth because they get concrete examples to cite.

Correctness

Agree on terminology with the interviewer. For example, what do you mean when you say “cache”? Is it a distributed or in-memory cache? Does it have a TTL? If you are drawing diagrams, have a shared understanding of what the boxes and lines represent. (As an interviewer, I often assist scattered candidates by restating what I believe they are saying.)
Focus on simplicity. This is the stage where you are making the simplest thing possible.
Model domain complexity precisely. Candidates often try to over-abstract or expand the scope of a system too early and design a floor wax that is also a dessert topping. Do not get side tracked solving tangent problems.
Demonstrate end to end understanding of critical flows. Prove that you have the simplest design that meets the requirements by walking through the important data flows step by step.

Error handling

How do you handle any logic errors?
How do you handle network errors?
- Any component may be permanently unavailable and you can’t keep adding more and more network components to prevent this.
- Do you retry? And how? Some common ideas:
  - Solve thundering herd problems using jitter.
  - Solve retry blasts by giving up on some requests and returning "DO NOT RETRY" to prevent more retries.
How will errors be surfaced in the system’s API? Which things are exceptional?

Reliability

Eliminate Single Points of Failure (SPOFs). For each component of the system identify how it might bring the whole system down if it is unavailable and remedy it. Usual fixes:
- Load balancing
- Shards
- Backups, standbys, and redundancy
Identify opportunities where single components can continue working in a partially degraded state when the components it uses are not reliable. Try to minimize total failure for critical flows. This works hand in hand with eliminating SPOFs: it is reliable in depth.
Bias towards predictable performance over sometimes optimal performance. Usual fixes:
- Use index hints in the database components.
- Avoid hot shard keys by avoiding hashing solutions in favor of dedicated capacity for known exceptional cases.
- Distribute requests with simple logic, e.g. round robin or random strategies.
Cover latency requirements and make it a part of trade-off discussions.

Scaling

Identify which components will fall over first due to load. Discuss options for horizontal and vertical scaling. Adding more hosts might require being more stateless and/or introducing eventual consistency.
Consider duplicating infrastructure in multiple regions or data centers. Within each region have multiple copies of each system component. (This is much easier now that SPOFs are dealt with.) Identify areas where batching or aggregation are appropriate. If you are starting to introduce caches, consider how they will be populated. (Maybe pre-populate them.)
Now that we have mapped so much out in the previous steps, we can consider scale properly and modify requirements. Often there is no way to scale further than to drop or loosen guarantees or invariants. (Moving things async, eventual consistency, less reliability for non-critical tasks, priority tiers.)
With so many nodes in a distributed system you might delve into how you might handle service discovery, access control, and cryptography (public key infrastructure (PKI), network meshes (m/TLS)) if your system has particular needs in these areas.

Maintenance

Describe what common changes might be made to the system. Make those changes simple.
Describe what we can do to ensure the system works as we expect. Observability such as logging, audit trail, metrics, detectors, and alerting.
Design the system for the organization it is built for. How is ownership of different aspects of the system managed? As the system evolves, who should be making the changes? How are engineers properly incentivized to keep the system healthy?

Conclusion

System design interviews can be really awesome. As an interviewer, the interview is so much more dynamic and engaging than pass-or-fail code exercises. For candidates, these interviews can be quite stressful and daunting. Most interview processes are poorly designed and executed to really understand someone’s capabilities–it is a hard problem with time constraints after all.

Here I have laid out my definition of the essence of an excellent, time-constrained system design:

Design the simplest thing first, then expand upon it.
Design for the right problem, stakeholders, and outcomes.
Present a clear narrative of how you go about designing the system.

As an interviewer, I want to be coming out of the interview knowing the candidate has done all these things so it is to your advantage to facilitate that.

The interview process is two sided, the employer assesses you and you assess the employer. Design interviews will be run differently. The important meta point to take away from this article is, “Does this approach to assessing a system design as an interviewer and this approach to designing systems as a candidate resonate with you?”

If it does, I recommend taking some of the advice here and applying it on both sides of the table if possible. It would really make the industry a better place in my opinion.

If it doesn’t, what does? Consider both sides of the table, what you would assess and how you could best perform well by your measure. And most importantly let me know!

Thanks for reading!

If you made it this far, feel free to get in touch with me about doing a mock system design interview when you go on the hunt for your next job–I’m happy to help.

Published 10/30/2022

Subscribe for new articles