Do code smells actually cause bugs? The research says yes - with important caveats

The empirical evidence is strong. Some smells correlate with defect density at statistically significant levels across multiple independent studies spanning three decades. Some do not. This page presents the data honestly, with the caveats the data requires.

§ 01

Effect Sizes by Smell (Rahman 2025 Meta-Study)

Smell	ρ (Spearman)	Strength	Notes
God Class	0.38	Strong	Strongest single-smell effect in Rahman 2025
Feature Envy	0.31	Strong	Coupling mechanism; Basili 1996 confirms
Duplicate Code	0.27	Moderate	Shotgun Surgery and bug propagation
Long Method	0.25	Moderate	CC correlation strongest sub-effect
Shotgun Surgery	0.24	Moderate	Change-proneness primary mechanism
Inappropriate Intimacy	0.22	Moderate	Coupling; concurrent mutation risk
Parallel Inheritance	0.20	Moderate	Extension-point brittleness
Data Clumps	0.14	Weak	Validation fragmentation
Primitive Obsession	0.13	Weak	Type-safety absence; input validation gaps
Speculative Generality	0.10	Weak	Dead abstraction confusion
Data Class	0.07	Negligible	Scattered logic in callers, not in the class
Comments (apology)	0.04	Negligible	Signal, not cause; marks other smells

Source: Rahman 2025 meta-analysis of 28 primary studies. ρ = Spearman rank correlation coefficient. Values above 0.25 are considered moderate-to-strong in software engineering research contexts.

§ 02

Bavota 2015: When and Why Code Starts to Smell Bad

Bavota, De Lucia, Di Penta, Oliveto, and Palomba's “When and why your code starts to smell bad” (ICSE 2015) is the canonical follow-up study to Fowler. The paper analysed the git history of several large open-source Java systems and found:

Smells are not introduced all at once. They emerge gradually through accretion.
Smells are more likely to be introduced under commit messages indicating time pressure (“hotfix,” “quick fix,” “deadline”).
Files that develop smells become immediately more change-prone and more fault-prone. The temporal coupling of smell introduction and defect increase is tight.
Smells introduced by developers with lower commit frequency (less experienced or less familiar with the codebase) have higher associated defect rates than smells introduced by frequent committers.

The practical implication: the first sprint after a rushed release is the highest-risk window for smell accumulation. Post-deadline technical-debt booking is not just good practice; it is a defect-prevention investment.

§ 03

CodeScene's Behavioural Code Analysis

Adam Tornhill's Your Code As A Crime Scene (Pragmatic Bookshelf, 2015) and Software Design X-Rays (2018) introduced the behavioural code analysis framework that became CodeScene. The core insight: static analysis finds smells in the code; behavioural analysis finds smells in the evolution of the code.

The three-signal hotspot model: a file that is (1) complex (high CC), (2) changed frequently (high commit frequency), and (3) fragmented in ownership (many distinct authors) predicts incidents better than any single signal. CodeScene's public research shows that the top 5% of files by this combined hotspot score account for approximately 60% of production defects in the systems studied.

The temporal coupling dimension is particularly valuable: two files that are always modified in the same commits are temporally coupled, even if no static analysis tool finds a direct dependency. Temporal coupling reveals implicit shotgun surgery that static analysis misses.

§ 04

Bird et al. 2011: Ownership and Defects

Christian Bird, Nachiappan Nagappan, Brendan Murphy, Harald Gall, and Premkumar Devanbu's “Don't Touch My Code! Examining the Effects of Ownership on Software Quality” (FSE 2011, Microsoft Research) is one of the most cited papers in software quality research.

Key finding: components with a single “strong owner” (contributor with >75% of commits) have significantly lower defect rates than components with fragmented ownership (many contributors each below 25%). The study controlled for component size, age, and change rate.

The implication for God Classes: a class that attracts contributions from authentication engineers, billing engineers, and notification engineers is structurally fragmented. The ownership fragmentation is a direct consequence of the class's accumulated responsibilities. The defect rate is a predictable outcome.

§ 05

Basili 1996: Coupling Metrics as Quality Indicators

Victor Basili, Lionel Briand, and Walcelio Melo's “A Validation of Object-Oriented Design Metrics as Quality Indicators” (IEEE Transactions on Software Engineering, 1996) was the first rigorous empirical validation of the Chidamber-Kemerer metrics suite.

The study collected defect data from an industrial Java project at NASA Goddard Space Flight Center and found that Coupling Between Objects (CBO) and Response For a Class (RFC) were the strongest predictors of fault-proneness. Weighted Methods per Class (WMC, a complexity proxy) was also predictive. Depth of Inheritance Tree (DIT) had mixed results.

Basili 1996 is the empirical backbone for Feature Envy and God Class cost estimates on this site. Both smells directly increase CBO - Feature Envy by creating outbound coupling, God Class by attracting inbound coupling.

§ 06

Caveats You Have to Be Honest About

Correlation is not causation

The research establishes correlation. A smell-dense class may be defect-prone because it is the hardest-worked part of the system, not because of the smell per se. Causal inference in software engineering research is genuinely hard.

Open-source selection bias

Most studies use open-source Java systems where defects are tracked in public issue trackers. These may not generalise to private enterprise codebases with different defect recording practices, different development cultures, and different levels of testing investment.

Tool detection vs human judgement gap

The 'smell detected by SonarCloud' and the 'smell as judged by an experienced engineer' are different things. Studies that rely on automated smell detection may measure the tool's false-positive rate as much as the smell's actual prevalence.

Publication bias

Studies that find no significant correlation between smells and defects are less likely to be published than studies that find a significant correlation. The literature may overstate effect sizes.

§ 07

What to Do with This

Prioritise active remediation for smells with ρ > 0.25: God Class, Feature Envy, Duplicate Code, Long Method, Shotgun Surgery. These are the smells with the strongest and most replicated defect correlations. For these smells, the business case for refactoring is empirically supported.

Accept that low-correlation smells (Comments, Data Class) may be stylistic preferences with limited defect impact. Do not spend engineering capital removing Comments smells when God Classes remain unaddressed.

God Class deep-dive Feature Envy deep-dive Full catalog ROI model Research references