Stop running and push back against fan-out code migrations. Do effective self-run migrations instead.
As someone who works on a platform team at a large tech company, a week ago I realized something. "Completing company-wide code migrations without involving other teams" is my love language.
This love language is a developed taste from learned experience. Years ago, I ran a code migration going the other way: there was a lot of code at the company to move to a new framework, I cut Jira tickets for teams, documented and tracked overall progress, and wore the hat of a project manager coordinating impacted teams through the process.
My migration was one of many code migrations that the company undertook that year. I think there were more than a dozen migrations of that size and much, much larger that we placed on the plates of basically every engineering team in the company. There was internal momentum that this style of migration was a trifecta of how to operate:
It's easier with years of retrospect to be able to identify and understand how a mind virus came to be, perhaps even easier when you as someone managing one of these projects yourself were both a victim (of other people's migrations) and a driver/exemplar. I was on both sides of the coin, most often simultaneously.
I was promoted in no small part due to my handling of the migration I shepherded. The migration completed in a relatively short period of time. The code migration took my team from one which seemed perpetually dead-locked in code maintenance to positive feature velocity. What wasn't to love about this way of going about code migrations?
Ultimately, the main thing that I regret is that I made the lives of the people around me and even people I didn't know at the company objectively and subjectively worse. The whirlwind of migrations and deadlines facing every team led a lot of people to feel bad and hate their jobs. Heated conversations were commonplace and many weeks of people's lives were lost not only on the work but in the coordination overhead as well.
There's a lot a business can weather. I am not a business. These terrible experiences and emotions are what I remember about that period of time.
For a business, these problems do eventually come to roost. The cultural effects of people feeling helpless, stressed, and disengaged make waves which do harm the business. While I feel the business did correct for this virus over time, it was not a permanently learned lesson for the company like it was for me.
For completeness, the downsides of running code migrations in the shape of "fan out work to teams, coordinate them to do it" are:
As stated above, the psychological problems individuals faced led to attrition of top talent and enforced a negative culture around how work is done.
Now, I recognize that what was happening at this time might be considered an extreme case of code migrations gone wrong. I would instead argue: Rome doesn't fall in a day. All it takes is a loose pattern to form and implicit incentives to align to create such a mind virus.
Was it an extreme situation in the sense of being an industry-wide outlier? No.
Was it extreme because fan-out code migrations were the local-maximum worst they got? Yes, and we call that the climax. You build towards a climax.
What could have been completed by a handful of engineers became a handful of engineers loosely shepherding hundreds of engineers.
"Code ownership" in the context of fanned-out code migrations answered the question of, "for this piece of code, who will do the work to change it?" Now this was already an expectation of what being a code owner meant before the migration spree, however the spree was the first to put that responsibility to the test.
Ownership going into the spree was pretty loose because the cost of ownership was low. With a gauntlet of migrations coming from across the company, competing with each team's actual work, naturally what had been a minor consideration became massive. Tactically, because migrations might take whole engineering months away from your team, weeks or months of bickering about ownership were worth it if they succeeded in deferring ownership to someone else: it was a net win for the team.
And so that's exactly what we saw: months of arguing over ownership of work that would only take days. You knew taking ownership of some code on one migration's battlefield meant losing battles for upcoming migrations where the changes might in fact take months, not the days being argued about in the present moment. Everyone wanted ownership decided in their favor.
Additionally, the ownership discussions were doubly unfruitful and difficult because all teams involved in ownership discussions did not care about the work. The migrations had no value to them. It's one thing to have heated discussions about important things, but it's a whole other level of toxic to have heated discussions about things which are not important to all involved parties.
(As an aside, I don't care for "ownership" of code because it is a backstop only for the mediocre people you should have let go anyways. I much prefer stewardship: identifying the people who care the most, have the most context, and have a vision of a piece of software. I understand that's a tough needle of nuance to thread in larger organizations. But life's tough, figure it out because it's important.)
Individual teams owning their small slice of a code migration meant that they would also own the incidents that resulted from the changes. For each migration we saw multiple teams have incidents. My code migration I believe resulted in about five or six incidents for various teams.
Incidents are bad, but what was uniquely bad about these incidents:
The incidents also revealed the incorrectness of one of the primary reasons people naively bias towards fanning out to teams: the team shepherding the code migration doesn't know other teams' code intimately enough to change it efficiently.
These incidents presented the real truth: no one knew the code intimately enough to change it efficiently. Teams "own" code but the reality is that the tenure of an individual at the company is at best a few years. The situated context needs to be constructed anew no matter who is doing the work.
Individual teams would often try to page in the code migration sheppards to help with an incident. It was more often than not the case that the code migration sheppards had more context because they at least knew half the problem: their own bits that needed changed, which was more than a team not knowing their own code in any particular capacity had.
Many individuals were dinged in performance review for not hitting their own team goals. A classic tale as old as time: the company, through indecisiveness, made an attempt to have its cake and eat it too:
Now of course this is poor leadership. Setting clear priorities and deciding what does and does not need to happen at a company level can only be done by leaders. Putting that aside, let's talk about how fan-out code migrations are a poor fit for a company that does performance reviews.
Performance reviews are conducted on the basis of:
Fan-out code migrations are the antithesis of consistent novelty for software engineers. Hundreds of engineers are doing the exact same work as you are. Since everyone does it, it's considered brain-dead gameplay (even when it might not be in any given situation) by the time performance reviews come around. At best your assessment for that stretch of work will be a nil-case "doesn't hurt, doesn't help" and "low risk, low reward." For individuals trying to get promoted the work is a temporary death sentence for their career.
So for individuals there's no incentive to do the work, certainly not do it well, and the poor individuals left holding the bag are penalized for it.
You or your team need to make a company-wide change. The correct approach is to do it yourselves.
Woah, I just said "correct" but every good answer is "it depends." Silly me, let me clarify into something more of a universal truth: directionally the code migrations a company carries out should progress towards self-run, non-fanout migrations to ensure:
And some code migrations don't need to happen.
Fan-out migrations are most appealing because they are a means to skirt staffing issues. Your team has something really important (in your eyes) but leadership, implicitly or explicitly, disagrees and does not staff the effort. Fan-out migrations are a means to sneak around that disagreement. Resolving the disagreement is a lot of work and, lacking psychological safety and good leaders, outright dangerous work to embark on to get that proper funding.
Most people in those situations would reasonably decide to pursue other options, because the work isn't actually so important to stake themselves on. (Hint: then it's not important.)
So without changes to staffing and the work still to do, the next problem will be that your estimates of the work show that if your team does the work it will take until the end of time. With only your team's man-power competing with the growth of the code you'll never catch up. You need man-power and it has to come from somewhere. And thus the fan-out migration idea is formed.
If there's no prior art at your current company, there's art from your past or other companies. You do a trial balloon: you try the fan-out migration, most teams don't do the work, but a sliver do for whatever reason, and leadership either doesn't care or doesn't catch wind of it. This is now a strategy you and others can employ to skirt staffing issues. If not stomped out, these organization hacks will eventually snowball into the situation described above.
The important bits here are:
Being disingenuous is complicated: it is often a mixture of operating in a bubble and subconscious thinking that distorts urgency and importance. You need to press yourself really hard to justify this work to the business, and first to yourself. If you can't justify it to yourself and make it a priority for yourself, how can you expect others to work on it?
My team was asked to do a fairly large migration awhile ago. I let them know this work wasn't important to us so we wouldn't be prioritizing it. They said because they were an understaffed team they couldn't do the work. I asked, "Well if this is important enough to ask us, this must be a top priority for you?" They quickly rebutted, "No, no, no. We've got much more important things to do: X, Y, Z..." Long story short, the solution was not to fan-out work no one cared about. It was to follow the need for the migration back to its source and ask the team that would benefit most to put in the work, not everyone else.
If the work isn't important to you, it's even less important to anyone else.
Okay, now you've convinced yourself that this work is important to the company. Let's approach it from a clean slate and see what we can do with a fan-out migration off the table.
Like any project, the most important thing to figure out is the goal. There's not too much in particular to recommend here given how open-ended your goals might be. You definitely don't want to start at "we need a code migration" and work backwards.
A rough goal is fine to proceed into the next few sections because we can shape it a bit more by placing emphasis on data-driven, output-oriented decisions.
Every migration has a before and an after. We'll want to measure progress and know when the goal is reached.
Data we want for an initial assessment:
Utilization is a measure of importance. "If a tree falls in the woods and no one is around to hear it, does it make a sound?" You might have thousands of pieces of code to upgrade, but if they are dead code already it doesn't quite matter what you do.
Some concrete examples of count and utilization:
Assuming email types are distinctly defined in code, we can use source control history and greps to compute current count as well as velocity and acceleration of growth in email types.
For the number of emails sent, we will need to consult databases, logs, and/or metrics to understand which email types are sent and what their trends are.
With this data in hand, we can start to talk about strategies.
For completeness, you might find that after looking at the data you might have negative velocity and acceleration on your hands. In which case, the migration might be simple: time will eliminate the problem or at worst make it more manageable, you may have to do much less than you anticipated to kill it off if you can wait.
Let's now assume non-negative growth or that you can't wait. Assume X (e.g. an email type) is the subject of the code migration (often the concept), A is the current solution, B is the new solution. Some archetypes:
In terms of difficulty and cost: ratchet < partial < total. In terms of value, well that's where things get more complicated.
At a company where the number of X is doubling every year, simply doing a ratchet to prevent new As will achieve a partial migration of 50% to Bs in just a year's time. This is the power of the "golden path": by making the easiest thing the best thing, time will be your greatest ally in enacting change.
Partial migrations also have significant benefits when we factor in utilization. It's very rare to have a migration that when measured by utilization doesn't follow a Pareto distribution aka the 80-20 rule: "80% of outcomes are due to 20% of causes." There will be big fish to fry and the rest have increasingly diminishing returns.
Utilization is the more appropriate thing to strategize on because it is output-based. We don't care how many email types we have, we care that, for example, "emails are set to the correct users" 99% of the time.
This is where we see with data-driven, output-based decisions that total migrations have little room to stand. Achieving 100% has two merits:
Baselines are awesome ideals, but diminishing returns on the utilization front make the ratchet and partial approaches much more appealing. After all, customers don't complain loud enough about the 1% case that's not just migrated that they rarely see.
Complexity can be a valid concern, but its reduction is often better fought for on different battlefields: those of security, reliability, and storage and compute cost. Achieving 100% isn't about the migration, it's about the inherent cost of keeping something around. Code's pretty cheap but data breaches, incidents, and servers can be quite costly. Those will be much more palatable reasons to kill something completely.
Total migrations are always a terrible idea from a project's outset. If you've decided 100% is the appropriate strategy, you're probably wrong and setting the project up for failure. Almost assuredly, the correct place to stop a code migration for diminishing returns is before 100%. If the ratchet and partial migrations work and a new goal presents itself to finally cut the remaining bits: great. Just don't jump into total migrations to start.
Let's contrast the downsides of running code migrations in the shape of "fan out work to teams, coordinate them to do it" to self-run migrations.
Self-run migrations inherently don't have this problem. For every other team, the migrations are all win. Someone is coming in and making improvements. Sounds lovely.
We can't escape the fact that there can be poor execution here, but it's isolated and easier to hold individuals accountable. There's no magic fix in doing self-run migrations for culture if the general sense is that poor execution is the norm. Self-run migrations will be more successful in general and can boost morale slightly, but engineering cultures place heavy emphasis on the negatives. If your goal is to improve culture alone, you'll want to chase the most negative negatives instead (unless code migrations are one).
A handful of engineers > handful of engineers loosely shepherding hundreds of engineers.
The main issue, disruption of team goals, is not present here. The people doing the work are the people who care about the work.
Nothing is taken away from teams due to this work so they have no reason to be defensive, only rightfully cautious of changes to their systems in general. In fact teams are even receptive to accepting ownership of slightly more things if the migrating team does the work and provides good documentation/support for those bits going forward.
There will be incidents. However we know the migration will be centrally managed. There will be multiple small incremental changes. There will be consistent and centrally managed rollback mechanisms. The issues while making changes will be learned immediately by the central team doing the rollout.
This is a much better situation than the dozens of independent incidents, tragic repeat issues, and churn of start/stops. As the migration plan evolves and is reworked due to incidents or new knowledge, the central team will be the only one adapting.
Self-run code migrations are the epitome of "consistent novelty." A small number of individuals are making a huge impact, not simply coordinating one. Here, we don't wrongly promote coordination or busywork. Most importantly, no one is given work that's not valuable, takes away from the work they should be doing, and that won't have an impact.
I did not talk about concrete engineering approaches in this article. We can't start to talk about approaches without a strong understanding of what to avoid. Having made the case for self-run migrations, we can next discuss efficient change management. That's the next problem, for another day.