Five case studies in what bad code actually costs
Most tech-debt writing is either catastrophising (Therac-25 is the only case study that gets cited) or hagiographic (Google's monorepo is the only success story that gets cited). A grown-up case-study section shows both, and shows the numbers either way.
The shared thread: every case is only legible in retrospect. The code-smell catalog exists so the retrospective does not arrive too late.
CASE STUDY 01
Knight Capital Group
1 August 2012 • NYSE trading • 45 minutes
TOTAL LOSS
$440M
On 1 August 2012, Knight Capital Group deployed new trading software. The deployment repurposed a feature flag - SMARTS - that had previously been used to enable a dormant code path called “Power Peg.” Power Peg had not been active since 2003 but had never been deleted.
When the flag was set during deployment on seven of eight servers, Power Peg re-enabled on those seven. The eighth server, without the flag, ran the new intended behaviour. The result: two opposing algorithmic trading strategies executing simultaneously. In 45 minutes of NYSE trading, Knight accumulated positions it could not unwind, losing approximately $440 million. Knight was acquired by Getco four months later, effectively ending the firm's independent existence.
The SEC's administrative proceeding (Release No. 34-70694, October 2013) documents the failure in detail. The root causes are engineering-practice failures that map directly to the code-smell catalog:
- Dead Code - Power Peg was dormant for nine years and retained in the codebase. Had it been deleted in 2004 when it was retired, the flag repurposing would have been harmless.
- Feature Flag hygiene - repurposing a retired flag for new behaviour without removing the old association. The flag's original semantics were still active in the dormant code path.
- Deployment discipline - no canary deployment, no automated rollback trigger, no health-check gate between servers. The inconsistency across eight servers could have been caught within seconds with a deployment health check.
Henrico Dolfing's 2019 analysis (“Case Study 4: Knight Capital Group,” henricodolfing.com) and Nanex's independent reconstruction of Knight's order flow corroborate the SEC's finding. The lesson is not that trading software is uniquely risky. It is that dormant code is expensive inventory. A code smell does not have to be “bad style” to cost four hundred and forty million dollars.
Sources: SEC Release No. 34-70694 (2013). Dolfing, H. (2019). Nanex (2012). Knight Capital Group (2012) CEO statement.
Related: Dead Code • incidentcost.com (incident cost calculator)
CASE STUDY 02
Therac-25 Radiation Therapy System
1985–1987 • Medical radiation • 6 accidents, 3 deaths
SEVERITY
Catastrophic
Between 1985 and 1987, the Therac-25 radiation therapy machine delivered massive radiation overdoses to at least six patients in North America. Three died. The root cause was a race condition in the software controlling the machine's treatment modes.
Leveson and Turner's landmark 1993 IEEE Computer paper (“An Investigation of the Therac-25 Accidents”) remains the canonical analysis. The failure mode: the Therac-25 reused software from the Therac-20, the previous model, but removed the hardware safety interlocks that the Therac-20 had relied on. The software carried an assumption (the hardware will prevent unsafe states) that was no longer true.
The specific race condition: a shared global counter between two concurrent routines - the operator interface and the treatment delivery system - could overflow to zero at precisely the wrong moment, leaving the machine in an unsafe state. This is the Inappropriate Intimacy smell at the most dangerous extreme: two concurrent routines sharing mutable state, coupled in a way that the original Therac-20 hardware had insulated.
The Therac-25 case is cited in software engineering education for a reason beyond its tragedy: it illustrates that smells in safety-critical systems accumulate a probability-of-harm that manifests only in rare execution sequences. The race condition required a specific sequence of operator keystrokes within a specific timing window. It was reproducible only after months of investigation.
Sources: Leveson, N.G. and Turner, C.S. (1993). An Investigation of the Therac-25 Accidents. IEEE Computer, 26(7), 18-41. Leveson, N.G. (1995). Safeware: System Safety and Computers. Addison-Wesley.
Related: Inappropriate Intimacy • Bug rate correlations
CASE STUDY 03
Stripe's Testing Infrastructure Refactor
2020–2022 • Positive refactoring outcome
BUILD TIME REDUCTION
~30%
Stripe's engineering blog has published multiple posts (2020-2022) documenting a multi-quarter refactor of their testing infrastructure. The goal was to reduce test flakiness and accelerate deployments ahead of high-traffic events like Cyber Monday. Published outcomes: median build time reduced approximately 30%, rollback frequency dropped, and deployment confidence increased measurably.
The engineering culture artifact of the refactor is its framing. Stripe's engineering leadership described the investment as “build trust through speed” - a capital allocation framing rather than a technical hygiene framing. The work was justified on the basis of incident risk reduction and deployment-frequency increase, both quantified, both tied to revenue outcomes.
The lesson for most engineering leaders: Stripe's public framing of the refactor is a rhetorical template. They did not go to leadership and say “our tests are flaky.” They said: “flaky tests are costing us N hours of engineer time per week, reducing our deployment confidence, and creating a risk surface ahead of our highest-revenue period. Here is the investment cost and here is the expected return.” That framing gets budget. “Our tests are flaky” does not.
Sources: Stripe Engineering Blog (2020-2022). Patrick McKenzie (public commentary on Stripe's engineering culture, 2020-2022).
Related: Refactoring ROI memo • PR review time
CASE STUDY 04
Google's Monorepo and Trunk-Based Development
2016–ongoing • Structural prevention
Potvin and Levenberg's 2016 paper in Communications of the ACM (“Why Google Stores Billions of Lines of Code in a Single Repository”) is the canonical source on Google's monorepo approach. At the time of writing, the repository contained approximately 2 billion lines of code across 9 million source files, contributed to by 25,000 Google engineers.
The monorepo paired with trunk-based development creates a structural environment where many code smells are mechanically prevented. Long-lived feature branches - the primary incubator of Shotgun Surgery and Duplicate Code - are structurally impossible. Pre-submit checks enforce complexity gates, test coverage requirements, and linting rules at commit time. Ownership is tracked at the directory level, creating accountability without fragmentation.
Winters, Manshreck, and Wright's Software Engineering at Google (O'Reilly 2020) provides the practitioner-level detail. Chapter 9 on code review documents a median PR review latency under 24 hours - a figure most organisations cannot approach with fragmented repo structures and long-lived branch policies.
The lesson: structural engineering decisions compound over decades. The cost of the alternative (a thousand small repositories with drift, inconsistent tooling, and cross-team coupling through external APIs) is invisible until a migration is attempted. Google's monorepo is not a universal recommendation; its tooling investment is enormous. It is a data point on what a deliberate, consistent structural decision can do when sustained over two decades.
Sources: Potvin, R. and Levenberg, J. (2016). Why Google Stores Billions of Lines of Code in a Single Repository. CACM 59(7). Winters, T., Manshreck, T. and Wright, H. (2020). Software Engineering at Google. O'Reilly.
Related: PR review time research • Detection tools • Software Engineering at Google (review)
CASE STUDY 05
Spotify's Engineering Culture Correction
2016–2022 • Organisational architecture failure
The “Spotify Model” (squads, tribes, guilds, chapters) was widely lionised as a template for scaling engineering culture. Between 2012 and 2016, it was a genuine Spotify innovation that helped scale from a startup to a global platform. Between 2016 and 2022, it was quietly abandoned by Spotify itself and by most organisations that had adopted it.
Jeremiah Lee's 2020 post (“Failed #SquadGoals,” jeremiahlee.com) documented the internal collapse. The model that scaled at 2015 Spotify did not scale at 2020 Spotify. Matrix reporting created accountability voids: squads without clear ownership, guilds without authority, and chapters without engineering leverage. A growing feature-team anti-pattern emerged - squads shipping features without cross-squad coordination, creating the equivalent of Shotgun Surgery at the organisational level.
The parallel to this site's argument is explicit: cultural architecture has the same structural half-life as code architecture. A model that is right for a 500-engineer company is not necessarily right for a 2,000-engineer company. Periodic honest retrospectives on your organisational structure are a maintenance tax, and they are cheaper than the accumulated cultural debt of ignoring drift. Spotify's correction, when it came, required significant reorganisation and leadership reframing.
The sister-publication angle: featurebloat.com covers the product-layer version of the same argument. The Spotify case is where the two publications meet: a feature-team anti-pattern in the organisational architecture produces the same effects as feature bloat at the product layer and code smells at the technical layer. The same diagnosis, at three zoom levels.
Sources: Lee, J. (2020). Failed #SquadGoals. jeremiahlee.com. Gibbon, C. (2022). Spotify's Engineering Culture retrospective. Spotify Engineering Blog (2012-2014, original model documentation).
“The shared thread is that every case is legible only in retrospect. Knight Capital's Power Peg looked harmless on 31 July. Therac-25's mode counter looked clean on Therac-20. The code-smell catalog exists so the retrospective does not arrive too late.”