ArticlesEngineering

Code migrations

Stop running and push back against fan-out code migrations. Do effective self-run migrations instead.

As someone who works on a platform team at a large tech company, a week ago I realized something. "Completing company-wide code migrations without involving other teams" is my love language.

Involving other teams

This love language is a developed taste from learned experience. Years ago, I ran a code migration going the other way: there was a lot of code at the company to move to a new framework, I cut Jira tickets for teams, documented and tracked overall progress, and wore the hat of a project manager coordinating impacted teams through the process.

At the climax of a mind virus

My migration was one of many code migrations that the company undertook that year. I think there were more than a dozen migrations of that size and much, much larger that we placed on the plates of basically every engineering team in the company. There was internal momentum that this style of migration was a trifecta of how to operate:

  • Individuals were awarded in performance reviews for cross-company impact. To be seen as someone who could work across the company with many teams on a problem was more important to one's career than simply doing the work. This dynamic had been building slowly over time as management rarely asked whether projects could be done better with less coordination and recognized the most visible, interconnected work.
  • Teams were conditioned throughout this gauntlet of migrations to not push back. In the mix of migrations were legitimate company priorities and non-priorities that were coming too fast to decipher. Each team's work was so upended by these migrations teams began to plan around them. In a healthy state, teams would simply push back and that'd be the end of that.
  • Teams, particularly platform and infrastructure teams, had been stifled for quite some time due to lack of direct staffing and investment. It had been common for software migrations to take years or never be completed. In lieu of clear focus and with a hazy migration green light, there were many projects for which many teams felt the time was now or never to complete given the momentum around code migrations.

It's easier with years of retrospect to be able to identify and understand how a mind virus came to be, perhaps even easier when you as someone managing one of these projects yourself were both a victim (of other people's migrations) and a driver/exemplar. I was on both sides of the coin, most often simultaneously.

I was promoted in no small part due to my handling of the migration I shepherded. The migration completed in a relatively short period of time. The code migration took my team from one which seemed perpetually dead-locked in code maintenance to positive feature velocity. What wasn't to love about this way of going about code migrations?

Ultimately, the main thing that I regret is that I made the lives of the people around me and even people I didn't know at the company objectively and subjectively worse. The whirlwind of migrations and deadlines facing every team led a lot of people to feel bad and hate their jobs. Heated conversations were commonplace and many weeks of people's lives were lost not only on the work but in the coordination overhead as well.

There's a lot a business can weather. I am not a business. These terrible experiences and emotions are what I remember about that period of time.

For a business, these problems do eventually come to roost. The cultural effects of people feeling helpless, stressed, and disengaged make waves which do harm the business. While I feel the business did correct for this virus over time, it was not a permanently learned lesson for the company like it was for me.

Downsides of fan-out migrations

For completeness, the downsides of running code migrations in the shape of "fan out work to teams, coordinate them to do it" are:

Attrition and negative cultural effects

As stated above, the psychological problems individuals faced led to attrition of top talent and enforced a negative culture around how work is done.

Now, I recognize that what was happening at this time might be considered an extreme case of code migrations gone wrong. I would instead argue: Rome doesn't fall in a day. All it takes is a loose pattern to form and implicit incentives to align to create such a mind virus.

Was it an extreme situation in the sense of being an industry-wide outlier? No.

Was it extreme because fan-out code migrations were the local-maximum worst they got? Yes, and we call that the climax. You build towards a climax.

Significant time lost to communication overhead

What could have been completed by a handful of engineers became a handful of engineers loosely shepherding hundreds of engineers.

Significant time lost to discussions around ownership

"Code ownership" in the context of fanned-out code migrations answered the question of, "for this piece of code, who will do the work to change it?" Now this was already an expectation of what being a code owner meant before the migration spree, however the spree was the first to put that responsibility to the test.

Ownership going into the spree was pretty loose because the cost of ownership was low. With a gauntlet of migrations coming from across the company, competing with each team's actual work, naturally what had been a minor consideration became massive. Tactically, because migrations might take whole engineering months away from your team, weeks or months of bickering about ownership were worth it if they succeeded in deferring ownership to someone else: it was a net win for the team.

And so that's exactly what we saw: months of arguing over ownership of work that would only take days. You knew taking ownership of some code on one migration's battlefield meant losing battles for upcoming migrations where the changes might in fact take months, not the days being argued about in the present moment. Everyone wanted ownership decided in their favor.

Additionally, the ownership discussions were doubly unfruitful and difficult because all teams involved in ownership discussions did not care about the work. The migrations had no value to them. It's one thing to have heated discussions about important things, but it's a whole other level of toxic to have heated discussions about things which are not important to all involved parties.

(As an aside, I don't care for "ownership" of code because it is a backstop only for the mediocre people you should have let go anyways. I much prefer stewardship: identifying the people who care the most, have the most context, and have a vision of a piece of software. I understand that's a tough needle of nuance to thread in larger organizations. But life's tough, figure it out because it's important.)

Tons of incidents

Individual teams owning their small slice of a code migration meant that they would also own the incidents that resulted from the changes. For each migration we saw multiple teams have incidents. My code migration I believe resulted in about five or six incidents for various teams.

Incidents are bad, but what was uniquely bad about these incidents:

  • Incident learnings traveled slowly. Most incidents were repeats. One team would have an issue and by the time they had gotten a handle on their situation another teams' changes were tipping over. The code migration work was distributed faster than we could learn in real-time how to do the migrations better.
  • Learnings often didn't travel to the code migration team soon enough for them to help. Shepherding a code migration is a 9-5 job, but incidents are 24/7 affairs.
  • Once "enough" incidents happened some migrations were downright paused, reworked centrally and then rebooted and re-assigned. One particular migration spent the whole calendar year in a start/stop pattern before it was even 20% complete. These incidents magnified communication overhead as one migration was actually a series of coordinated migration attempts. This made planning a useless exercise.

The incidents also revealed the incorrectness of one of the primary reasons people naively bias towards fanning out to teams: the team shepherding the code migration doesn't know other teams' code intimately enough to change it efficiently.

These incidents presented the real truth: no one knew the code intimately enough to change it efficiently. Teams "own" code but the reality is that the tenure of an individual at the company is at best a few years. The situated context needs to be constructed anew no matter who is doing the work.

Individual teams would often try to page in the code migration sheppards to help with an incident. It was more often than not the case that the code migration sheppards had more context because they at least knew half the problem: their own bits that needed changed, which was more than a team not knowing their own code in any particular capacity had.

Individuals are punished

Many individuals were dinged in performance review for not hitting their own team goals. A classic tale as old as time: the company, through indecisiveness, made an attempt to have its cake and eat it too:

  • Let the engineering organization run wild upending itself with code migrations.
  • Still hold teams accountable to their goals as if those migrations were not going on and were not valuable.

Now of course this is poor leadership. Setting clear priorities and deciding what does and does not need to happen at a company level can only be done by leaders. Putting that aside, let's talk about how fan-out code migrations are a poor fit for a company that does performance reviews.

Performance reviews are conducted on the basis of:

  • An individual's performance as measured against their peers.
  • The approximate measure of performance is impact: what did the individual do to push the business forward, how optimal was that push.
  • At anything other than a small company, "impact" of an individual is incredibly difficult to assess in terms of business value.
  • In effect, what is valued is mostly weighted by the other factor: against their peers. How one "stacks up" to how others in their immediate team and others company-wide in the same role create "impact."
  • Lacking good measure, the things reviews can look at are: novelty and consistency. More precisely, the enduring measure of performance is "consistent novelty."

Fan-out code migrations are the antithesis of consistent novelty for software engineers. Hundreds of engineers are doing the exact same work as you are. Since everyone does it, it's considered brain-dead gameplay (even when it might not be in any given situation) by the time performance reviews come around. At best your assessment for that stretch of work will be a nil-case "doesn't hurt, doesn't help" and "low risk, low reward." For individuals trying to get promoted the work is a temporary death sentence for their career.

So for individuals there's no incentive to do the work, certainly not do it well, and the poor individuals left holding the bag are penalized for it.

Do the code migration yourself

You or your team need to make a company-wide change. The correct approach is to do it yourselves.

Woah, I just said "correct" but every good answer is "it depends." Silly me, let me clarify into something more of a universal truth: directionally the code migrations a company carries out should progress towards self-run, non-fanout migrations to ensure:

  • success of those projects (displacement),
  • a positive development velocity (velocity),
  • and healthy engineering culture (acceleration).

And some code migrations don't need to happen.

Fan-outs are side steps

Fan-out migrations are most appealing because they are a means to skirt staffing issues. Your team has something really important (in your eyes) but leadership, implicitly or explicitly, disagrees and does not staff the effort. Fan-out migrations are a means to sneak around that disagreement. Resolving the disagreement is a lot of work and, lacking psychological safety and good leaders, outright dangerous work to embark on to get that proper funding.

Most people in those situations would reasonably decide to pursue other options, because the work isn't actually so important to stake themselves on. (Hint: then it's not important.)

So without changes to staffing and the work still to do, the next problem will be that your estimates of the work show that if your team does the work it will take until the end of time. With only your team's man-power competing with the growth of the code you'll never catch up. You need man-power and it has to come from somewhere. And thus the fan-out migration idea is formed.

If there's no prior art at your current company, there's art from your past or other companies. You do a trial balloon: you try the fan-out migration, most teams don't do the work, but a sliver do for whatever reason, and leadership either doesn't care or doesn't catch wind of it. This is now a strategy you and others can employ to skirt staffing issues. If not stomped out, these organization hacks will eventually snowball into the situation described above.

The important bits here are:

  • When you pursue a fan-out migration, leadership is not on your side in the right way.
  • Anyone advocating a fan-out migration is either being disingenuous about the situation and their estimates, or lacks the expertise to do better.

Being disingenuous is complicated: it is often a mixture of operating in a bubble and subconscious thinking that distorts urgency and importance. You need to press yourself really hard to justify this work to the business, and first to yourself. If you can't justify it to yourself and make it a priority for yourself, how can you expect others to work on it?

My team was asked to do a fairly large migration awhile ago. I let them know this work wasn't important to us so we wouldn't be prioritizing it. They said because they were an understaffed team they couldn't do the work. I asked, "Well if this is important enough to ask us, this must be a top priority for you?" They quickly rebutted, "No, no, no. We've got much more important things to do: X, Y, Z..." Long story short, the solution was not to fan-out work no one cared about. It was to follow the need for the migration back to its source and ask the team that would benefit most to put in the work, not everyone else.

If the work isn't important to you, it's even less important to anyone else.

Okay, now you've convinced yourself that this work is important to the company. Let's approach it from a clean slate and see what we can do with a fan-out migration off the table.

Define the goal

Like any project, the most important thing to figure out is the goal. There's not too much in particular to recommend here given how open-ended your goals might be. You definitely don't want to start at "we need a code migration" and work backwards.

A rough goal is fine to proceed into the next few sections because we can shape it a bit more by placing emphasis on data-driven, output-oriented decisions.

Gather the data

Every migration has a before and an after. We'll want to measure progress and know when the goal is reached.

Data we want for an initial assessment:

  • an approximate count of the how many things need changed,
  • the approximate "utilization" of each of those things,
  • an understanding of the derivatives of these things: the velocity and acceleration of number-of-things and their utilization.

Utilization is a measure of importance. "If a tree falls in the woods and no one is around to hear it, does it make a sound?" You might have thousands of pieces of code to upgrade, but if they are dead code already it doesn't quite matter what you do.

Some concrete examples of count and utilization:

  • Number of email types in a codebase,
  • Number of emails sent to customers per type.

Assuming email types are distinctly defined in code, we can use source control history and greps to compute current count as well as velocity and acceleration of growth in email types.

For the number of emails sent, we will need to consult databases, logs, and/or metrics to understand which email types are sent and what their trends are.

With this data in hand, we can start to talk about strategies.

Decide a strategy

For completeness, you might find that after looking at the data you might have negative velocity and acceleration on your hands. In which case, the migration might be simple: time will eliminate the problem or at worst make it more manageable, you may have to do much less than you anticipated to kill it off if you can wait.

Let's now assume non-negative growth or that you can't wait. Assume X (e.g. an email type) is the subject of the code migration (often the concept), A is the current solution, B is the new solution. Some archetypes:

  • Total migration: 100% of A is now B
  • Partial migration: 50/90/95% of A is now B
  • Ratchet migration: All new Xs are Bs instead of As, alternatively "no more As"

In terms of difficulty and cost: ratchet < partial < total. In terms of value, well that's where things get more complicated.

At a company where the number of X is doubling every year, simply doing a ratchet to prevent new As will achieve a partial migration of 50% to Bs in just a year's time. This is the power of the "golden path": by making the easiest thing the best thing, time will be your greatest ally in enacting change.

Partial migrations also have significant benefits when we factor in utilization. It's very rare to have a migration that when measured by utilization doesn't follow a Pareto distribution aka the 80-20 rule: "80% of outcomes are due to 20% of causes." There will be big fish to fry and the rest have increasingly diminishing returns.

Utilization is the more appropriate thing to strategize on because it is output-based. We don't care how many email types we have, we care that, for example, "emails are set to the correct users" 99% of the time.

This is where we see with data-driven, output-based decisions that total migrations have little room to stand. Achieving 100% has two merits:

  • Setting a new baseline (e.g. "all emails must be localized")
  • Simplifying complexity (e.g. "we only need to reason about a single mailing system")

Baselines are awesome ideals, but diminishing returns on the utilization front make the ratchet and partial approaches much more appealing. After all, customers don't complain loud enough about the 1% case that's not just migrated that they rarely see.

Complexity can be a valid concern, but its reduction is often better fought for on different battlefields: those of security, reliability, and storage and compute cost. Achieving 100% isn't about the migration, it's about the inherent cost of keeping something around. Code's pretty cheap but data breaches, incidents, and servers can be quite costly. Those will be much more palatable reasons to kill something completely.

Total migrations are always a terrible idea from a project's outset. If you've decided 100% is the appropriate strategy, you're probably wrong and setting the project up for failure. Almost assuredly, the correct place to stop a code migration for diminishing returns is before 100%. If the ratchet and partial migrations work and a new goal presents itself to finally cut the remaining bits: great. Just don't jump into total migrations to start.

Upsides of self-run migrations

Let's contrast the downsides of running code migrations in the shape of "fan out work to teams, coordinate them to do it" to self-run migrations.

Attrition and negative cultural effects

Self-run migrations inherently don't have this problem. For every other team, the migrations are all win. Someone is coming in and making improvements. Sounds lovely.

We can't escape the fact that there can be poor execution here, but it's isolated and easier to hold individuals accountable. There's no magic fix in doing self-run migrations for culture if the general sense is that poor execution is the norm. Self-run migrations will be more successful in general and can boost morale slightly, but engineering cultures place heavy emphasis on the negatives. If your goal is to improve culture alone, you'll want to chase the most negative negatives instead (unless code migrations are one).

Significant time lost to communication overhead

A handful of engineers > handful of engineers loosely shepherding hundreds of engineers.

Significant time lost to discussions around ownership

The main issue, disruption of team goals, is not present here. The people doing the work are the people who care about the work.

Nothing is taken away from teams due to this work so they have no reason to be defensive, only rightfully cautious of changes to their systems in general. In fact teams are even receptive to accepting ownership of slightly more things if the migrating team does the work and provides good documentation/support for those bits going forward.

Tons of incidents

There will be incidents. However we know the migration will be centrally managed. There will be multiple small incremental changes. There will be consistent and centrally managed rollback mechanisms. The issues while making changes will be learned immediately by the central team doing the rollout.

This is a much better situation than the dozens of independent incidents, tragic repeat issues, and churn of start/stops. As the migration plan evolves and is reworked due to incidents or new knowledge, the central team will be the only one adapting.

Individuals are punished

Self-run code migrations are the epitome of "consistent novelty." A small number of individuals are making a huge impact, not simply coordinating one. Here, we don't wrongly promote coordination or busywork. Most importantly, no one is given work that's not valuable, takes away from the work they should be doing, and that won't have an impact.

Summary

  • Never do fan-out code migrations. They suck on every dimension.
  • Fan-out migrations are disingenuous. How can someone expect others to do work they don't feel is important enough to do themselves?
  • Never plan total migrations from the outset. Achieving 100% is actually two separate goals and ought to be two separate projects: (1) capturing sufficient utilization to address the majority of an outcome-based problem, then (2) mitigating a cost or risk.
  • To prevent fan-out migrations from wreaking havoc on your engineering organization:
    • As a leader, promote work done, not work coordinated. Don't applaud fan-out migrations as they are process failures.
    • As a team receiving work, push back for your and your company's sake. It's often as simply as saying that you won't be doing it.
    • As a team needing to get a lot of work done, put in the work to make a case for getting proper support from the company, find a goal you can hit with what you have, or let it go. Fanning out work can't be an option on the table.

I did not talk about concrete engineering approaches in this article. We can't start to talk about approaches without a strong understanding of what to avoid. Having made the case for self-run migrations, we can next discuss efficient change management. That's the next problem, for another day.