Duplicate Code: why copy-paste is the most expensive shortcut you will ever take
Fowler opens the catalog's enumeration with Duplicate Code: “if you see the same code structure in more than one place, you can be sure that your program will be better if you find a way to unify them.” (Refactoring 2nd ed., ch. 3). He called it the number one smell because it is both common and directly tractable: duplication has a deterministic fix.
Spinellis' 2006 analysis of production open-source codebases estimated that 30-50% of code is duplicated at some level of abstraction. In proprietary enterprise codebases, the figure is typically higher because there is less peer review pressure to catch early duplication.
Textual duplication is the easy case: the same code block copy-pasted. PMD's CPD (Copy-Paste Detector) finds it automatically. The cost is Shotgun Surgery: fix a bug in one copy, remember to fix all the others. The ones you forget are the defects.
Semantic duplication is harder: two implementations of the same concept, written differently by two engineers who did not know about each other. A discount calculator in OrderService and a discount calculator in InvoiceService, implemented slightly differently. One handles edge cases the other does not. Over time they diverge further.
// Textual duplication: exact copy across three files
// OrderService.ts
function applyDiscount(price: number, tier: string): number {
if (tier === 'GOLD') return price * 0.9;
if (tier === 'SILVER') return price * 0.95;
return price;
}
// InvoiceService.ts (exact copy from 8 months ago)
function applyDiscount(price: number, tier: string): number {
if (tier === 'GOLD') return price * 0.9;
if (tier === 'SILVER') return price * 0.95;
return price;
}
// ReportService.ts (copy, but PLATINUM tier missing)
function applyDiscount(price: number, tier: string): number {
if (tier === 'GOLD') return price * 0.9;
if (tier === 'SILVER') return price * 0.95;
// PLATINUM was added to OrderService last month.
// ReportService still uses the old calculation.
return price;
}Shotgun Surgery liability
One change requires N edits, where N is the number of copies. Engineering teams miss the edit in at least one location approximately 20-30% of the time, based on incident post-mortem patterns. Each missed edit is a latent defect.
Bug propagation
A bug fixed in one copy propagates to the other copies only if the engineer knows about them. CodeScene's temporal coupling analysis shows that copy-paste clusters frequently have non-overlapping ownership - different engineers own each copy, and the bug-fix PR does not include all of them.
Test-suite bloat
Each copy must be tested independently, or the test suite is incomplete. A 30-50% duplication rate in a codebase implies a similar bloat in the test suite - or equivalent gaps in coverage.
Review confusion
A reviewer who sees the same logic twice either asks 'why is this duplicated?' (adding review latency) or does not notice (adding defect risk). Neither is a good outcome.
This is where the field gets nuanced, and where intellectual honesty matters more than orthodoxy.
Hunt and Thomas' original DRY principle in The Pragmatic Programmer (1999, 20th anniversary ed. 2019) was about knowledge, not literal code. “Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.” Two functions that calculate a discount are one copy of knowledge - but two functions that happen to share the same implementation details but represent different business concepts are not.
Sandi Metz' 2016 essay “The Wrong Abstraction” (sandimetz.com) argues: “duplication is far cheaper than the wrong abstraction.” When two call sites look identical today but are evolving toward divergent behaviour, extracting them into a shared function creates a coupling that will eventually require a workaround parameter or a fork.
John Ousterhout in A Philosophy of Software Design (2018): some duplication is preferable to an over-general abstraction. A well-placed duplicate is cheaper to understand than a general function with a boolean flag that switches between two behaviours.
The working heuristic: if two blocks of code are copies of the same knowledge (same rules, same invariants, same domain concept), unify them. If they happen to look similar today but represent different concepts evolving independently, leave them separate and document why.
// Step 1: Extract Method for duplication within a class // (Fowler ch. 6) // Step 2: Extract Class for shared logic between classes // - create DiscountCalculator // - inject or import it in OrderService and InvoiceService // - delete the duplicate implementations // Step 3: Pull Up Method for duplication in sibling subclasses // (Fowler ch. 12) // - move the shared method to the common superclass // Detection first: run PMD CPD before you refactor // pmdc cpd --files src --minimum-tokens 100
| Tool | Approach | Best for |
|---|---|---|
| PMD CPD | Token-based clone detection | Cross-file, cross-language. Best-in-class for Java. |
| SonarCloud | Duplication percentage by file and by module | CI gate - fails build if duplication exceeds threshold. |
| CodeScene | Copy-paste line count + temporal coupling | Identifying which duplicates are co-evolving (likely same knowledge). |
| JetBrains IDEs | Built-in duplicates inspection | Developer-time detection during active editing. |
- Prototype code being actively iterated. You do not know the final abstraction yet. Do not extract until the shape is clear.
- Code scheduled for deletion. Refactoring a module that will be deleted next sprint is capital destruction.
- Two call sites evolving toward divergence. If you know they will diverge, do not couple them. Document the intended divergence instead.
The discipline: annotate the duplication. A comment that says “intentional duplicate of X because Y” is not a smell; it is a decision record. The smell is unacknowledged duplication.