Appendix B: Stage-by-Stage Metrics Guide

This appendix is reference material for Chapter 9. It maps what to measure, what to ignore, and what tools to use at each stage of the CTO role described in Chapter 1.

The governing principle is Lena Reinhard’s: "Metrics create incentives, that create behaviours, that create culture."[1] Every metric you introduce will change how the team works — sometimes in ways you intended, sometimes not. The second principle is Will Larson’s: "The number one measurement risk is measuring nothing because you’re trying to measure everything."[2] At every stage, the CTO should measure the minimum that produces actionable insight and resist the pressure to instrument more.

In 2025, DORA abandoned the four-tier performance model (elite, high, medium, low) that practitioners had been citing as benchmarks for six years, replacing it with seven behavioural archetypes.[3] The shift confirmed what Chapter 9 argues: DORA clusters were never targets. They were diagnostic patterns that changed every year. The benchmarks in this appendix are reference points for calibration, not goals to optimise toward.

Stage 1: Coder — Seed, 1–5 Engineers

What to measure. Three things, as Chapter 9 recommends: deployment frequency, recovery speed, and team morale. Deployment frequency is what Charity Majors calls "the heartbeat of your company. It should be regular. It should happen consistently. It should be boring."[4] At this stage, aim to deploy at least daily — LinearB’s benchmark data shows startups deploy 18% faster than enterprises.[5] Recovery speed means: when something breaks in production, how long does it take to fix? Track this informally. Team morale means: in your weekly one-to-ones, are people energised or frustrated? Reinhard’s guidance for this stage: "Everyone works more or less on everything and the team size is small enough so that everyone still has visibility into what’s going on. Use a rudimentary project management tool. At this stage, anything more would probably add too much overhead and slow teams down."[1]

What not to measure. Anything at the individual level. No lines of code, no pull requests per person, no story points, no velocity. Martin Fowler’s argument remains definitive: "Any true measure of software development productivity must be based on delivered business value. False measures only make things worse."[6] Larson is blunter about the tools that promise individual productivity measurement: "I’m personally convinced that these companies are selling products that harm the companies that use them rather than help them."[2]

Goal-setting. Informal. Dave Bailey’s "one metric that matters" principle applies: identify the single number that tells you whether the product is working, and orient the team around it.[7] If you introduce OKRs at this stage, you are spending process time that should be spent shipping.

Tools. GitHub or GitLab built-in analytics. No dedicated metrics platform. Cost: free.

Board reporting. At seed, you are unlikely to have a board that demands engineering metrics. If asked, report two numbers: how often you ship and whether production is stable. Frame engineering investment as a percentage of total spend — at seed, this is typically 40–95% of revenue, which is expected.[8]

Healthcare addendum. Track patching cadence (time from vulnerability disclosure to patch applied). Begin SOC 2 readiness documentation if you are pursuing enterprise healthcare customers. Audit log retention should be configured from day one — HIPAA requires six years minimum (Chapter 4).

Stage 2: Manager — Series A, 5–15 Engineers

What to measure. Everything from Stage 1, plus three additions. First, a developer experience signal — a quarterly or every-eight-weeks survey. Laura Tacho, CTO of DX, recommends this cadence: "Every 8 weeks is a good cadence. Running the survey twice a year or less doesn’t provide enough data."[9] At this team size, keep surveys short — five to seven questions on a Likert scale covering satisfaction, friction points, and confidence in the codebase. Tacho warns about anonymity at small scale: "Even basic info like job title can blow cover in small teams."[9] Second, track the ratio of new-capability work to maintenance work. If more than 40% of engineering capacity is going to maintenance, your technical debt load is affecting your ability to compete — the threshold Chelsea Troy identifies in Chapter 5. Third, begin tracking cycle time: the elapsed time from first commit to production deployment. LinearB’s benchmark for "good" cycle time is under 72 hours at the 75th percentile.[5]

What not to measure. Velocity as a cross-team comparison tool. Story points are a planning aid for individual teams, not a productivity metric. Do not rank engineers against each other on any quantitative dimension. Gergely Orosz warns: "Every new thing you start measuring will lead to engineers optimising to make that measure look better. A problem with measurements focused on effort and output is that they transform the engineering culture to one where 'slack time' is frowned upon."[10]

Goal-setting. Bailey’s lightweight OKR adaptation: four-to-six-week cycles, one or two key results per objective, scoped to customers and team.[7] Alternatively, Spotify’s DIBB framework (Data, Insight, Belief, Bet) if the company is product-led.[11] Marty Cagan’s diagnostic applies: "If a company is still using feature teams, then the OKR technique is going to be a cultural mismatch."[12]

Tools. Swarmia Lite (~$240/developer/year) or LinearB free tier for delivery metrics. A simple developer experience survey — DX’s platform, or a five-question homebrew form. Cost: $0–$3,600/year.

Board reporting. Report three categories, following Tomasz Tunguz’s recommendation: reliability (percentage of requests served successfully), availability (percentage served within latency targets), and a quality proxy such as support-ticket-to-active-user ratio.[13] Camille Fournier’s principle: never report Agile internals to the board. "The engineering department is not a standalone function, and it’s certainly not a factory cranking out code widgets."[14] Frame engineering capacity in business terms: "We are spending X% of engineering time on new capabilities, Y% on maintenance, and Z% on debt reduction."

Healthcare addendum. Add mean time to detect (MTTD) and mean time to contain (MTTC) for security incidents. Begin formal access-control reviews. If not already in progress, start the SOC 2 Type 2 audit cycle (Chapter 4). Consider compliance automation — Vanta or Drata — which maps SOC 2 controls to HIPAA requirements and reduces manual audit burden.

Stage 3: Director — Series B, 15–50 Engineers

What to measure. Everything from Stage 2, plus four additions. First, the full set of DORA metrics — deployment frequency, lead time for changes, change failure rate, and time to recover — now tracked from production systems rather than estimated in conversation. Use these as a diagnostic, not a scorecard: identify the most constrained metric and address that constraint (Chapter 9). Second, investment allocation: what percentage of engineering time goes to new capabilities, maintenance, and debt reduction? DX Core 4 benchmarks show that top-quartile tech companies spend roughly 6% more of their time on new capabilities than non-tech companies.[15] Swarmia recommends a Balance Framework target: 60% new capabilities and improvements, 15% developer productivity, 10% keeping-the-lights-on.[16] Third, cross-team delivery metrics: are handoffs between teams creating bottlenecks? Fourth, SLA compliance for customer-facing systems — 99.9% uptime (three nines, roughly 8.76 hours of downtime per year) is baseline for most SaaS; 99.99% (four nines, roughly 52 minutes per year) is the target for critical paths.

What not to measure. Do not treat the DORA tier labels as fixed targets — they shifted every year even before DORA abandoned them. Do not let board-level metrics cascade into team-level pressure. Kent Beck’s warning: "Trying to meet that need through metrics or even surveys inevitably poisons the data on which productivity measurement relies."[17] Hillel Wayne extends the point: "Even 100% honest pursuit of a metric, taken far enough, is harmful to your goals, and this is an inescapable consequence of the difference between metrics and values."[18]

Goal-setting. OKRs may begin to earn their keep at this stage, provided the engineering team is empowered to determine how to solve the problems they are assigned rather than being told what to build (Cagan’s distinction from Chapter 9). If the team is still operating as a feature factory, OKRs are a bureaucratic layer on a delivery function.

Tools. LinearB Business (~$49/month per contributor), Swarmia full platform, or DX for developer experience measurement. Compliance automation (Vanta, Drata, or Secureframe) becomes essential at this stage for healthcare companies. Cost: $9,000–$30,000/year.

Board reporting. Tunguz’s three metrics (reliability, availability, incident rate) remain the core.[13] Add investment allocation as a pie chart — the board needs to see where engineering time goes and whether the ratio is shifting in the right direction. Add a hiring and retention section: open roles, time to fill, regrettable attrition. Translate velocity into business language: not "we shipped 12 features" but "we reduced customer onboarding time from three weeks to four days." R&D as a percentage of revenue should be in the 25–35% range at this stage.[8]

Healthcare addendum. SOC 2 Type 2 audit on a 6–12-month cycle. Incident response time benchmarks: target detection within hours, containment within 24 hours. Full compliance dashboard tracking patching cadence, access-control reviews, encryption status, and tenant data isolation. Average healthcare data breach cost is $10.93 million (IBM Security) — the board needs to understand that compliance investment is risk reduction with a quantifiable upside.

Stage 4: Strategist — Series C+, 50+ Engineers

What to measure. Full DORA, SPACE, and DX Core 4 infrastructure. DX Core 4 combines one system metric (pull requests per engineer — top-quartile benchmark is 4.3 per week at small-to-medium tech companies), one perceptual metric (Developer Experience Index), one quality metric (change failure rate — benchmark range 2.8–4.6%), and one business-alignment metric (percentage of time on new capabilities versus maintenance).[15] Abi Noda’s benchmark: "For every one-point increase in DXI score, you save about 10 minutes per week per engineer."[15] At 50+ engineers, that is measurable in headcount-equivalent terms and becomes a compelling board narrative. Add financial metrics: R&D as a percentage of revenue (20–35% at this stage is typical; median at IPO is 23–26%), revenue per employee ($130,000+ for private SaaS, $283,000+ for public SaaS), and the Rule of 40 (growth rate + profit margin ≥ 40%).[8]

What not to measure. Individual productivity scores. McKinsey’s 2023 proposal to measure individual developer output provoked a unified backlash from Orosz, Beck, Noda, and Forsgren — the practitioners who created the frameworks McKinsey was citing.[10] The consensus was unambiguous: individual output metrics destroy the collaboration that makes teams productive. Tacho’s principle: "What I’m not measuring is also important. I’ve made a choice here not to directly measure things like PR count, commits, or even story points."[9]

Goal-setting. Full OKR infrastructure with business-outcome framing, or DIBB at scale. Board-level reporting that translates engineering metrics into the language of the Rule of 40 and capital efficiency.

Tools. Jellyfish, Faros AI, Cortex or CorralData for engineering intelligence and financial alignment. Full compliance platform. Custom executive dashboards. Cost: enterprise pricing.

Board reporting. The board at this stage is evaluating engineering as a financial function. Report: R&D efficiency (revenue per engineer, R&D as percentage of revenue), delivery health (DORA metrics in trend-line form, not single data points), team health (developer experience trend, attrition), and risk posture (compliance status, incident trends, security metrics). Fournier’s framing applies at every stage but is most critical here: the CTO who presents engineering as a cost centre will be managed as one. The CTO who presents engineering as a value-creation engine — with the metrics to prove it — will be treated as a strategic partner.

Healthcare addendum. Full compliance dashboard with board-level reporting. If pursuing FDA clearance for software as a medical device (SaMD), add regulatory submission metrics. Continuous monitoring across all SOC 2 trust service criteria. Breach notification readiness drills. Target mean time to restore critical systems under 24 hours — Censinet’s benchmark for healthcare organisations using automation is 4.2 hours.[19]

What to Avoid at Every Stage

Five measurement anti-patterns, each attributed to the practitioner who named it most precisely:

Individual productivity scoring. Larson: tools that measure individual developer output "harm the companies that use them."[2]

Lines of code, velocity points, or PRs as productivity proxies. Fowler: "Code that’s well designed and factored will be shorter because it eliminates the duplication. Copy and paste programming leads to high LOC counts and poor design."[6]

DORA clusters as fixed benchmarks. The 2025 report abandoned them. They were always statistical artefacts of annual survey data, not performance targets.[3]

Measurement disconnected from business outcomes. Larson: "Don’t ask your CEO to evaluate engineering through optimisation metrics."[2] If the metric cannot be translated into language the board understands, it is an internal diagnostic, not a reporting tool.

Gaming through honest pursuit. Wayne: "We have metrics in the first place because what we actually care about is nonquantifiable."[18] The team that optimises for the metric at the expense of the value the metric was supposed to represent has succeeded at measurement and failed at engineering.


This appendix provides the reference frame. Chapter 9 provides the argument for why these metrics matter and the communication language for presenting them. Chapter 6 provides the deployment infrastructure that makes the delivery metrics possible. Chapter 5 provides the debt framework that gives the maintenance-versus-new-capability ratio its meaning.


1. Reinhard, L. (2023, November). What engineering metrics should I use? A guide for engineering managers, directors, and VPs. lenareinhard.com. https://www.lenareinhard.com/articles/what-engineering-metrics-should-i-use-a-guide-for-engineering-managers-directors-and-vps — Reinhard was VP of Engineering at CircleCI and Travis CI.
2. Larson, W. (2023). Measuring an engineering organization. Irrational Exuberance (lethain.com) / The Engineering Executive’s Primer (O’Reilly Media). https://lethain.com/measuring-engineering-organizations/ — See also Larson, W. My skepticism towards current developer meta-productivity tools. https://lethain.com/developer-meta-productivity-tools/
3. Stephens, R. (2025, December 18). DORA 2025: Measuring software delivery after AI. RedMonk. https://redmonk.com/rstephens/2025/12/18/dora2025/ — The 2025 report replaced the four-tier model with seven behavioural archetypes based on eight measures.
4. Majors, C. (~2023). Interview with Charles Humble. Hacking the Org Podcast, Container Solutions. https://blog.container-solutions.com/charity-majors-on-code-rewrites-observability-and-team-performance — See also Majors, C. (2020, July 7). Questionable advice: Can engineering productivity be measured? charity.wtf. https://charity.wtf/2020/07/07/questionable-advice-can-engineering-productivity-be-measured/
5. LinearB. (2026). Software engineering benchmarks report. https://linearb.io/resources/software-engineering-benchmarks-report — Dataset: 8.1M+ pull requests across 4,800+ teams. Elite cycle time benchmark: <25 hours (P75).
6. Fowler, M. (2003, August 29). CannotMeasureProductivity. martinfowler.com/bliki. https://martinfowler.com/bliki/CannotMeasureProductivity.html
7. Bailey, D. (2019, April 22; updated 2024, April 9). How to implement OKRs in an early-stage company. dave-bailey.com. https://www.dave-bailey.com/blog/startup-okrs — "In a startup, there are only two things that matter: your customers and your team."
8. R&D-as-percentage-of-revenue benchmarks compiled from: Lemkin, J. (SaaStr, 2023) — "The average public SaaS company spends about 50% on sales/marketing and 20% on engineering/product." Bessemer Venture Partners, "Scaling to $100 Million." Blossom Street Ventures analysis of 75 SaaS IPOs (median R&D at IPO: 23–26%). McKinsey, "SaaS and the Rule of 40." https://www.mckinsey.com/industries/technology-media-and-telecommunications/our-insights/saas-and-the-rule-of-40-keys-to-the-critical-value-creation-metric
9. Tacho, L. (CTO, DX). On developer experience surveys: Tacho, L. A deep dive into developer experience surveys. lauratacho.com. https://lauratacho.com/blog/a-deep-dive-into-developer-experience-surveys — On stage-appropriate measurement: Orosz, G. and Tacho, L. (2022, July 5). Measuring software engineering productivity. The Pragmatic Engineer. https://newsletter.pragmaticengineer.com/p/engineering-productivity
10. Orosz, G. and Beck, K. (2023, November). Measuring developer productivity? A response to McKinsey, Part 2. The Pragmatic Engineer. https://newsletter.pragmaticengineer.com/p/measuring-developer-productivity-part-2
11. Kniberg, H. (2016, June 8). Spotify rhythm. blog.crisp.se. https://blog.crisp.se/2016/06/08/henrikkniberg/spotify-rhythm — DIBB: Data, Insight, Belief, Bet. Strategic cadence of six months, tactical cadence of six weeks.
12. Cagan, M. (2020, March 2). Team objectives — Empowerment. svpg.com. https://www.svpg.com/team-objectives-empowerment/
13. Tunguz, T. (2019, July 8). Adding engineering metrics to the Redpoint SaaS metrics template. tomtunguz.com. https://tomtunguz.com/engineering-metrics/ — Three recommended board metrics: reliability, availability, incident rate. "I wish there was a way to get a sense of engineering cadence or productivity, but that remains elusive."
14. Fournier, C. (2016, May 26). Ask the CTO: Measuring team performance. O’Reilly Radar. https://www.oreilly.com/radar/ask-the-cto-measuring-team-performance/ — Fournier was CTO of Rent the Runway.
15. DX Core 4 benchmarks: Noda, A. (2024). 2024 benchmarks for the DX Core 4. DX Newsletter (newsletter.getdx.com). https://newsletter.getdx.com/p/2024-benchmarks-for-the-dx-core-4 — See also: Noda, A., Storey, M.-A., Forsgren, N., & Greiler, M. (2023). DevEx: What actually drives productivity. ACM Queue. https://queue.acm.org/detail.cfm?id=3595878
16. Swarmia. Engineering benchmarks. swarmia.com. https://www.swarmia.com/blog/engineering-benchmarks/ — Balance Framework: 60% new things and improvements, 15% productivity, 10% keeping-the-lights-on.
17. Beck, K. (2024, January 11). Productivity measurement as a tradeoff. Tidy First? (Substack). https://tidyfirst.substack.com/p/productivity-measurement-as-a-tradeoff
18. Wayne, H. (2024, September 17). Goodhart’s Law in software engineering. buttondown.com. https://buttondown.com/hillelwayne/archive/goodharts-law-in-software-engineering/
19. Censinet. Healthcare cybersecurity benchmarking: Key metrics. censinet.com. https://censinet.com/perspectives/healthcare-cybersecurity-benchmarking-key-metrics — 4.2-hour average EHR restoration time via automation; 24-hour target for full restoration.