Do code smells actually cause bugs? The research says yes - with important caveats
The empirical evidence is strong. Some smells correlate with defect density at statistically significant levels across multiple independent studies spanning three decades. Some do not. This page presents the data honestly, with the caveats the data requires.
| Smell | ρ (Spearman) | Strength | Notes |
|---|---|---|---|
| God Class | 0.38 | Strong | Strongest single-smell effect in Rahman 2025 |
| Feature Envy | 0.31 | Strong | Coupling mechanism; Basili 1996 confirms |
| Duplicate Code | 0.27 | Moderate | Shotgun Surgery and bug propagation |
| Long Method | 0.25 | Moderate | CC correlation strongest sub-effect |
| Shotgun Surgery | 0.24 | Moderate | Change-proneness primary mechanism |
| Inappropriate Intimacy | 0.22 | Moderate | Coupling; concurrent mutation risk |
| Parallel Inheritance | 0.20 | Moderate | Extension-point brittleness |
| Data Clumps | 0.14 | Weak | Validation fragmentation |
| Primitive Obsession | 0.13 | Weak | Type-safety absence; input validation gaps |
| Speculative Generality | 0.10 | Weak | Dead abstraction confusion |
| Data Class | 0.07 | Negligible | Scattered logic in callers, not in the class |
| Comments (apology) | 0.04 | Negligible | Signal, not cause; marks other smells |
Source: Rahman 2025 meta-analysis of 28 primary studies. ρ = Spearman rank correlation coefficient. Values above 0.25 are considered moderate-to-strong in software engineering research contexts.
Bavota, De Lucia, Di Penta, Oliveto, and Palomba's “When and why your code starts to smell bad” (ICSE 2015) is the canonical follow-up study to Fowler. The paper analysed the git history of several large open-source Java systems and found:
- Smells are not introduced all at once. They emerge gradually through accretion.
- Smells are more likely to be introduced under commit messages indicating time pressure (“hotfix,” “quick fix,” “deadline”).
- Files that develop smells become immediately more change-prone and more fault-prone. The temporal coupling of smell introduction and defect increase is tight.
- Smells introduced by developers with lower commit frequency (less experienced or less familiar with the codebase) have higher associated defect rates than smells introduced by frequent committers.
The practical implication: the first sprint after a rushed release is the highest-risk window for smell accumulation. Post-deadline technical-debt booking is not just good practice; it is a defect-prevention investment.
Adam Tornhill's Your Code As A Crime Scene (Pragmatic Bookshelf, 2015) and Software Design X-Rays (2018) introduced the behavioural code analysis framework that became CodeScene. The core insight: static analysis finds smells in the code; behavioural analysis finds smells in the evolution of the code.
The three-signal hotspot model: a file that is (1) complex (high CC), (2) changed frequently (high commit frequency), and (3) fragmented in ownership (many distinct authors) predicts incidents better than any single signal. CodeScene's public research shows that the top 5% of files by this combined hotspot score account for approximately 60% of production defects in the systems studied.
The temporal coupling dimension is particularly valuable: two files that are always modified in the same commits are temporally coupled, even if no static analysis tool finds a direct dependency. Temporal coupling reveals implicit shotgun surgery that static analysis misses.
Christian Bird, Nachiappan Nagappan, Brendan Murphy, Harald Gall, and Premkumar Devanbu's “Don't Touch My Code! Examining the Effects of Ownership on Software Quality” (FSE 2011, Microsoft Research) is one of the most cited papers in software quality research.
Key finding: components with a single “strong owner” (contributor with >75% of commits) have significantly lower defect rates than components with fragmented ownership (many contributors each below 25%). The study controlled for component size, age, and change rate.
The implication for God Classes: a class that attracts contributions from authentication engineers, billing engineers, and notification engineers is structurally fragmented. The ownership fragmentation is a direct consequence of the class's accumulated responsibilities. The defect rate is a predictable outcome.
Victor Basili, Lionel Briand, and Walcelio Melo's “A Validation of Object-Oriented Design Metrics as Quality Indicators” (IEEE Transactions on Software Engineering, 1996) was the first rigorous empirical validation of the Chidamber-Kemerer metrics suite.
The study collected defect data from an industrial Java project at NASA Goddard Space Flight Center and found that Coupling Between Objects (CBO) and Response For a Class (RFC) were the strongest predictors of fault-proneness. Weighted Methods per Class (WMC, a complexity proxy) was also predictive. Depth of Inheritance Tree (DIT) had mixed results.
Basili 1996 is the empirical backbone for Feature Envy and God Class cost estimates on this site. Both smells directly increase CBO - Feature Envy by creating outbound coupling, God Class by attracting inbound coupling.
Correlation is not causation
The research establishes correlation. A smell-dense class may be defect-prone because it is the hardest-worked part of the system, not because of the smell per se. Causal inference in software engineering research is genuinely hard.
Open-source selection bias
Most studies use open-source Java systems where defects are tracked in public issue trackers. These may not generalise to private enterprise codebases with different defect recording practices, different development cultures, and different levels of testing investment.
Tool detection vs human judgement gap
The 'smell detected by SonarCloud' and the 'smell as judged by an experienced engineer' are different things. Studies that rely on automated smell detection may measure the tool's false-positive rate as much as the smell's actual prevalence.
Publication bias
Studies that find no significant correlation between smells and defects are less likely to be published than studies that find a significant correlation. The literature may overstate effect sizes.
Prioritise active remediation for smells with ρ > 0.25: God Class, Feature Envy, Duplicate Code, Long Method, Shotgun Surgery. These are the smells with the strongest and most replicated defect correlations. For these smells, the business case for refactoring is empirically supported.
Accept that low-correlation smells (Comments, Data Class) may be stylistic preferences with limited defect impact. Do not spend engineering capital removing Comments smells when God Classes remain unaddressed.