Dealing with Technical Debt: A Radical Solution
Debt is an agreement whereby we receive a good today in exchange for future payments. In recent years, the term "technical debt" has found widespread adoption to describe a similar situation involved in producing software.
The good we receive? Producing deployable software. The future payments? Greater difficulty in maintaining that software.
Deploying (or shipping) software is always a tradeoff. As such, we always accumulate technical debt. The problem is that we often misunderstand just how corrosive to a system technical debt is -- how much those "low, easy payments" are going to be and to what degree they will cripple new development.
Technical debt doesn't just accumulate: each new bit of debt increases the instability of a system in a non-linear fashion. At some point, technical debt achieves a sort of "critical mass" and the system becomes chaotic. At that point, even small changes to the system become major occasions of uncertainty.
When I receive a call from a frantic manager along these lines: "Any change we make breaks something else", I know I'm hearing the symptoms of technical debt that has "gone critical".
Just as in areas of financial debt, the best strategy is to deal with debt before it becomes critical. There are a number of effective technical debt mediation strategies, including refactoring, implementing test-driven development, continuous integration, adherence to coding standards, peer review, etc. But for the distraught shop trying to deal with runaway debt, these strategies may not be sufficient.
In such situations, one possible option may be bankruptcy. Lest you think I may be stretching the financial analogy to the breaking point, let me describe one such situation.
ABC, Inc's flagship (not, of course, their real name) application was very successful and had produced profits for several years. The first indication that something was wrong was that the pace of development slowed: it took longer to add new features, as more of the developers' time was used in squashing what one developer described as "rolling bugs".
ABC's management did what seemed to be the prudent thing: they hired more developers to relieve pressure within the system. Oddly, though, the rate at which new features could be developed and deployed slowed even more.
Part of this was due to the well-understood paradox described in The Mythical Man-Month: in a complex system, the only ones capable of getting new developers up to speed are the most senior developers -- the very ones needed to deal with the system's mammoth complexity when introducing new features.
Management was confounded. It was at this point I became involved. The system itself was very complex, consisting of thousands of individual code files and a database that had been tinkered with endlessly. Program logic was spread (and duplicated) across code, stored procedures, and triggers. Literally no one had a full understanding of the system.
When things get to this level, other problems surface. One of them is a breakdown in communication between management and developers. That was the case with ABC. Management could not understand what had happened with their once-excellent developers. Meetings multiplied as management attempted to understand the problem. Unfortunately, their misunderstanding of technical debt led them to conclude that the developers just needed to "try harder". They applied uneven incentives that served only to fracture an already-fragile team.
But the developers were trying plenty hard. They felt the crushing stress of trying to deploy bug-free code, knowing that the odds for success were exceedingly low. It wasn't, as I explained to their manager, that the developers were writing buggy code; it was that the interactions within the system were so complex that no one could safely predict what areas of code a seemingly simple change might affect.
We were well past the point where incremental changes could reverse the descent into chaos. My suggestion: declare bankruptcy -- not financial bankruptcy, but technical bankruptcy.
My suggestion was to create a "skunkworks" project that would gradually replace the old, unstable code with newly-minted, well-structured code.
The developers were essential in determining the areas of the code that could be "modularized". "Starting over" gave the developers a much-needed sense of hope as they could imagine a not-too-distant future in which they could be successful. It gave managers a sense of relief that they could react more quickly to requests for new features.
Now, "starting over" is not a new strategy. When things appear to reach a breaking point, starting over may be very appealing. But unmanaged, such a decision often only delays the day of reckoning. For the strategy to succeed, I've found the following characteristics must be present:
- The skunkworks team needs to be limited in number. This can be difficult as most developers will gladly trade in the day-to-day stress of an unmanageable system to work on a new one. (I'll discuss one strategy for dealing with this below.)
- The scope of work for the skunkworks team needs to be well-defined. "Build us a new system" is not such a definition. Instead, finding candidates for modularization is essential. Once the appropriate module is determined, both management and the development team need to ensure that scope creep does not occur. In particular, management must make heroic efforts not to enlist the skunkworks team in "fighting fires."
- The daily upkeep of the existing system must go on. Taking a several-month hiatus to completely retool the entire system will be deeply appealing. In every case I'm aware of, it's been a catastrophic mistake. The incremental approach, combined with implementing new debt-mediation strategies as part of a new structured process is far more likely to achieve long-lasting success.
- The team makeup must change over time so that all developers are able to become "skunkworkers". Without this, two groups will develop: the elite skunkworks team and the schlubs. It is absolutely essential prevent this. Breaking the application into separate components (with well-defined responsibilies and a corresponding API) can help determine which members will be enlisted as skunkworkers for that particular module.
- All members of the development team must be made to feel part of the new effort. I love the idea of "learning lunches": the company furnishes lunch and developers spend their time together learning something new. That "something new" can be a discussion of the new architecture, new technologies that will implement that, and new processes the company will employ. It's important that those fortunate enough to be current members of the skunkworks team take this responsibility seriously. All members of the team are making real sacrifices to make the skunkworks viable and their efforts should be respected and rewarded.
In ABC's case, the complete migration took just under a year. Too long a time? Too great an investment? Those were certainly the concerns of management before they made the decision to proceed. After the fact, I spoke with them. Here's what one of their senior managers said:
We didn't like the idea at first. But we had a meeting in which we agreed to try to face the truth about the situation. We had tried everything we knew, so we were really at the point where we didn't see any other alternatives.
During this last year, there were a lot of worried managers. But every time one of those sections got rolled out, it increased everyone's confidence and made it that much easier to believe that we could, finally, fix the situation.
And the tension between management and developers?
In one of our meetings, I told both our managers and developers alike a quote attributed to Ben Franklin: "We must all hang together or we shall surely all hang separately." Everybody laughed -- and everybody bought into the idea that we were all in this together."
And the outlook for the future?
Before anything else, we wanted to make sure we were never in this situation again. We realized that we had the mistake of letting non-technical people make technical decisions. Really, that was an abuse of power. We didn't think of it that way at the time decisions were getting made, of course. Anyway, we take dealing with technical debt really seriously this time around. We listen to the developers more. Sometimes, they even listen to us!
The hardest part of the transition?
Like I said, there were some tense moments where we realized we were sort of betting our future on this project. And for a long time, there wasn't much feedback. One of the managers asked the skunkworks team if he could attend the learning lunches. He had to promise not to make any "helpful suggestions", but that turned out to be a great idea. For a few months, we couldn't see anything concrete, but those learning lunches helped sustain us with the belief that real progress was being made.
Advice for others?
Take technical debt seriously! I think we were really fortunate that we were able to salvage this situation. I can think of a lot of situations where the results might not be as good. Anyway, it's much better to deal with these things as part of a process than to let them get to the point where you have to undergo a radical change.


All new features are written adhering to the agreed design standard and sizable chunks of the old system are being refactored into new shiny versions, too.
Old Bugs are continuing to be quashed in the legacy code, using "smarter" choices for the solution of the bugs that allow for an easier upgrade (eventually) to the curreent, v2 format.
About the only thing we didn't give thought to - that was mentioned in the linked Martin Fowler post, is that we have not given any consideration to enabling the new code to be easily replaced in the future.
We simply saw V2 as being the (eventual) end-point.
Gavin.