Duplicate Code: why copy-paste is the most expensive shortcut you will ever take

Annual cost / team of 8

$18,000 – $95,000

scaled by duplication density. Methodology →

§ 01

Fowler's “Number One in the Stink Parade”

Fowler opens the catalog's enumeration with Duplicate Code: “if you see the same code structure in more than one place, you can be sure that your program will be better if you find a way to unify them.” (Refactoring 2nd ed., ch. 3). He called it the number one smell because it is both common and directly tractable: duplication has a deterministic fix.

Spinellis' 2006 analysis of production open-source codebases estimated that 30-50% of code is duplicated at some level of abstraction. In proprietary enterprise codebases, the figure is typically higher because there is less peer review pressure to catch early duplication.

§ 02

Two Kinds of Duplication

Textual duplication is the easy case: the same code block copy-pasted. PMD's CPD (Copy-Paste Detector) finds it automatically. The cost is Shotgun Surgery: fix a bug in one copy, remember to fix all the others. The ones you forget are the defects.

Semantic duplication is harder: two implementations of the same concept, written differently by two engineers who did not know about each other. A discount calculator in OrderService and a discount calculator in InvoiceService, implemented slightly differently. One handles edge cases the other does not. Over time they diverge further.

// Textual duplication: exact copy across three files
// OrderService.ts
function applyDiscount(price: number, tier: string): number {
  if (tier === 'GOLD') return price * 0.9;
  if (tier === 'SILVER') return price * 0.95;
  return price;
}

// InvoiceService.ts (exact copy from 8 months ago)
function applyDiscount(price: number, tier: string): number {
  if (tier === 'GOLD') return price * 0.9;
  if (tier === 'SILVER') return price * 0.95;
  return price;
}

// ReportService.ts (copy, but PLATINUM tier missing)
function applyDiscount(price: number, tier: string): number {
  if (tier === 'GOLD') return price * 0.9;
  if (tier === 'SILVER') return price * 0.95;
  // PLATINUM was added to OrderService last month.
  // ReportService still uses the old calculation.
  return price;
}

§ 03

The Cost Mechanism

Shotgun Surgery liability

One change requires N edits, where N is the number of copies. Engineering teams miss the edit in at least one location approximately 20-30% of the time, based on incident post-mortem patterns. Each missed edit is a latent defect.

Bug propagation

A bug fixed in one copy propagates to the other copies only if the engineer knows about them. CodeScene's temporal coupling analysis shows that copy-paste clusters frequently have non-overlapping ownership - different engineers own each copy, and the bug-fix PR does not include all of them.

Test-suite bloat

Each copy must be tested independently, or the test suite is incomplete. A 30-50% duplication rate in a codebase implies a similar bloat in the test suite - or equivalent gaps in coverage.

Review confusion

A reviewer who sees the same logic twice either asks 'why is this duplicated?' (adding review latency) or does not notice (adding defect risk). Neither is a good outcome.

§ 04

The DRY Counterpoint

This is where the field gets nuanced, and where intellectual honesty matters more than orthodoxy.

Hunt and Thomas' original DRY principle in The Pragmatic Programmer (1999, 20th anniversary ed. 2019) was about knowledge, not literal code. “Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.” Two functions that calculate a discount are one copy of knowledge - but two functions that happen to share the same implementation details but represent different business concepts are not.

Sandi Metz' 2016 essay “The Wrong Abstraction” (sandimetz.com) argues: “duplication is far cheaper than the wrong abstraction.” When two call sites look identical today but are evolving toward divergent behaviour, extracting them into a shared function creates a coupling that will eventually require a workaround parameter or a fork.

John Ousterhout in A Philosophy of Software Design (2018): some duplication is preferable to an over-general abstraction. A well-placed duplicate is cheaper to understand than a general function with a boolean flag that switches between two behaviours.

The working heuristic: if two blocks of code are copies of the same knowledge (same rules, same invariants, same domain concept), unify them. If they happen to look similar today but represent different concepts evolving independently, leave them separate and document why.

§ 05

The Refactoring Sequence

// Step 1: Extract Method for duplication within a class
// (Fowler ch. 6)

// Step 2: Extract Class for shared logic between classes
// - create DiscountCalculator
// - inject or import it in OrderService and InvoiceService
// - delete the duplicate implementations

// Step 3: Pull Up Method for duplication in sibling subclasses
// (Fowler ch. 12)
// - move the shared method to the common superclass

// Detection first: run PMD CPD before you refactor
// pmdc cpd --files src --minimum-tokens 100

§ 06

Detection Tools

Tool	Approach	Best for
PMD CPD	Token-based clone detection	Cross-file, cross-language. Best-in-class for Java.
SonarCloud	Duplication percentage by file and by module	CI gate - fails build if duplication exceeds threshold.
CodeScene	Copy-paste line count + temporal coupling	Identifying which duplicates are co-evolving (likely same knowledge).
JetBrains IDEs	Built-in duplicates inspection	Developer-time detection during active editing.

§ 07

When Duplication is OK

Prototype code being actively iterated. You do not know the final abstraction yet. Do not extract until the shape is clear.
Code scheduled for deletion. Refactoring a module that will be deleted next sprint is capital destruction.
Two call sites evolving toward divergence. If you know they will diverge, do not couple them. Document the intended divergence instead.

The discipline: annotate the duplication. A comment that says “intentional duplicate of X because Y” is not a smell; it is a decision record. The smell is unacknowledged duplication.

Full catalog God Class Long Method Feature Envy When smells are OK Detection tools References