Why your PRs take so long to review: the research on cognitive load, complexity, and review economics
The answer is not “my reviewer is busy.” Human cognitive capacity is the bottleneck. Review volume has outgrown it. The research from 1977 (Halstead) through 2018 (Campbell) through 2024 (DORA) draws a consistent line from code complexity to review latency to deployment frequency to engineering output.
Cyclomatic Complexity (McCabe 1976) counts decision paths: every if, else if, case, for, while, and logical operator adds one. It was designed to measure the minimum number of test cases required for full path coverage. For that purpose, it is still useful.
But cyclomatic complexity is a poor predictor of human review time. A flat switch statement with 50 cases has a cyclomatic complexity of 50 but is trivially readable: each case is independent. A 15-line method with three nested if/else blocks has a cyclomatic complexity of 8 but requires the reviewer to track three simultaneous conditions across nested scopes.
G. Ann Campbell's Cognitive Complexity metric (SonarSource whitepaper 2018) addresses this. Three rules:
- Each structural element that breaks linear flow (if, else, loops, switches) increments complexity by 1.
- Each nesting level of a structural element adds a further increment equal to the nesting depth.
- Certain constructs (method calls, boolean operators in conditions) are incremented or not incremented on the basis of their cognitive demand, not their control-flow contribution.
SonarSource's default threshold for a human review gate is 15 CC units per method. Methods above 25 are automatically flagged for refactoring. In practice, the threshold that correlates with reviewer saturation and defect escape is closer to 20-25.
// Cognitive Complexity vs Cyclomatic Complexity comparison
// Method A: 40 lines, CC=28, cyclomatic=8 // Nested: 3 levels deep, 4 branches at bottom level // A reviewer must hold the full state tree in memory // Method B: 120 lines, CC=9, cyclomatic=50 // Flat switch: 50 independent cases // A reviewer reads top to bottom, no state accumulation // Verdict: Method B takes less review time despite 3x the lines // Cyclomatic predicts test cases; CC predicts human effortMaurice Halstead's 1977 Elements of Software Science (Elsevier) introduced a set of software metrics derived from the counts of operators and operands in a program. The key derived metrics: vocabulary (unique operators + operands), volume (total operators + operands x log2 vocabulary), difficulty (operator diversity x operand density), and effort (volume x difficulty).
Halstead's metrics are historically interesting and empirically mixed. Some studies have found Halstead volume correlated with defect density; others have found it redundant with cyclomatic complexity. The most durable Halstead contribution is the maintainability index, a composite metric (partially Halstead-derived) that SonarCloud and Visual Studio still expose.
The honest practitioner position: cite Halstead when your tool reports it, understand what it measures, and do not over-weight it relative to Cognitive Complexity on review-time predictions.
Nicole Forsgren, Jez Humble, and Gene Kim's Accelerate (IT Revolution, 2018) is the empirical foundation for engineering performance measurement. The DORA four key metrics - deployment frequency, lead time for changes, change failure rate, and time to restore service - are now standard performance indicators across the industry.
DORA's 2024 State of DevOps report classifies teams into elite, high, medium, and low performance based on the four metrics. The review latency data:
| Performance tier | Review latency | Deployment frequency |
|---|---|---|
| Elite | < 1 hour | On demand (multiple/day) |
| High | 1 hour to 1 day | Daily to weekly |
| Medium | 1 day to 1 week | Weekly to monthly |
| Low | 1 week to 1 month | Monthly to 6-monthly |
Review latency is a leading indicator of deployment frequency. Teams that cannot merge in under 24 hours cannot deploy daily. Teams that cannot deploy daily cannot respond to customer feedback on the timescales that product-led growth requires. Code complexity is not the only driver of review latency, but it is the most tractable one.
Smaller PRs (median 200 LOC or less)
The Cisco/SmartBear study (2007) found that reviewer effectiveness drops sharply above 400 LOC and above 60 minutes of review. Google's internal data and DORA both confirm that PRs under 200 LOC have significantly lower latency and higher defect-detection rates than larger PRs. The practice: decompose feature work into shippable increments, not feature-complete bundles.
One responsibility per PR
PRs that mix refactoring with feature addition force reviewers to evaluate two different mental models simultaneously. A refactoring PR that modifies no logic is easy to review. A feature PR that modifies no structure is easy to review. A PR that does both requires the reviewer to confirm that the refactoring does not affect the feature and vice versa. Separate them.
Pre-submit automation removes machine-detectable issues
Every comment in a review that asks 'this exceeds the complexity threshold' or 'this variable is unused' is a comment that a linter could have caught before review opened. Move machine-detectable issues out of the human reviewer's queue and into a CI gate. The human reviewer's time is for design decisions that machines cannot make.
- Fixed review-time windows (09:00 and 15:00 daily) instead of always-on interrupts. Always-on review expectations fragment focus time. Batched windows preserve flow while maintaining low latency.
- Round-robin reviewer assignment with a fallback rotation. Self-assignment often concentrates review burden on senior engineers with the highest opportunity cost per review hour.
- Draft-then-full two-phase PR convention. A draft PR signals “feedback on design welcome, not yet ready for line-level review.” It prevents design-level review comments arriving after the author has already invested in implementation-level detail.
Early-stage product teams where merge-first-and-fix-later is the right tradeoff. A two-person startup does not need round-robin reviewer assignment. A prototype codebase does not need a cognitive-complexity gate. A one-sprint throwaway experiment does not need a methodology review.
The review-time investment compounds with the team size and the codebase age. Teams of two can ignore it. Teams of twenty cannot afford to. Read when smells are OK for the full counterpoint.