Metrics That Lie: What Engineering Leaders Measure Wrong

Engineering metrics and DORA dashboard measurement

The Measurement Trap in Engineering Leadership

Engineering leaders face a measurement problem that doesn't exist in most other disciplines. In sales, revenue is unambiguous. In finance, margin is directly observable. In engineering, the thing you actually care about — the sustained ability to deliver working software that creates business value — is maddeningly difficult to quantify. So leaders reach for proxies: velocity, commit counts, pull request throughput, ticket closure rates. The proxies are easy to measure. They are also routinely misleading.

The problem is not that measurement is bad. Measurement is essential for engineering organisations above a certain size — without it, leaders are flying blind and teams have no shared signal for whether they're improving or degrading. The problem is that the most commonly used engineering metrics are measuring the wrong things, at the wrong level of abstraction, with the wrong consequences when people optimise for them. They create a false picture of health that persists until something breaks badly enough to contradict it.

This post is a map of the most common measurement traps in engineering leadership, and a framework for building a metrics system that actually reflects the health of the organisation rather than the skill of the teams at gaming the measurement system. The goal is not to find perfect metrics — no such metrics exist. It is to find a combination of signals that are difficult to simultaneously game and that together give a more honest picture than any single metric can.

Velocity: The Most Misused Metric in Software Development

Sprint velocity is the single most widely used and most widely misused engineering metric. Used correctly — as an internal team signal for capacity planning — it is genuinely useful. Teams can look at their recent velocity and make reasonable estimates about how much they can take on in the next sprint. The signal is meaningful within a team, across time, for that specific purpose.

Used incorrectly — as an inter-team comparison metric, or as a performance signal for individual engineers, or as a KPI reported to executives — velocity produces perverse incentives almost immediately. Story point inflation is the most common response: when engineers know their velocity is being evaluated externally, estimates drift upward. A task that would have been estimated at a 3 in a healthy environment gets estimated at a 5. The team's velocity appears to grow. Nothing about the actual delivery rate has changed.

The deeper problem is that velocity measures output, not outcome. A team that completes fifty story points of work in a sprint has done a measurable amount of activity. Whether that activity created any business value, reduced any technical risk, or contributed meaningfully to the product's quality is entirely invisible in the velocity number. A team completing twenty story points of strategically important, technically excellent work is more valuable than a team completing sixty story points of low-value, poorly-executed tickets — but the velocity metric will consistently favour the latter.

Commit Count, Lines of Code, and Other Vanity Metrics

There is a category of engineering metrics that were invented to give non-technical stakeholders a feeling of visibility into engineering productivity. Commit count per engineer per week. Lines of code written per day. Pull request merge frequency. These metrics share a defining characteristic: they are easy to collect, easy to display on a dashboard, and deeply misleading as indicators of engineering quality or effectiveness.

Lines of code, in particular, is one of the most dangerous metrics in software development. More code is not better code. The engineers who make the most impactful contributions often do so by deleting code — removing duplication, simplifying an overcomplicated abstraction, eliminating a feature that was never used. A week in which a senior engineer deletes three thousand lines and writes five hundred high-quality replacements is a better engineering week than one in which a junior engineer writes four thousand lines of copy-pasted logic. The lines-of-code metric reports the opposite.

Commit count has similar problems. Splitting work into small, frequent commits is a good engineering practice — it makes code review easier, makes rollbacks simpler, and produces a clearer history. But if commit count is being measured as a proxy for productivity, engineers will split work artificially to inflate the number. The commit count rises. The code quality does not. Vanity metrics are not just useless — they actively misdirect the engineering culture in ways that take years to unwind.

How DORA Metrics Get Gamed

The DORA metrics — Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Mean Time to Restore — represent the most rigorously validated framework for engineering performance measurement available. They emerged from years of research across thousands of engineering teams and have strong predictive validity for organisational performance. They are also, like every metric system, gameable once organisations start treating them as targets rather than diagnostic signals.

Deployment frequency is the easiest to game: if you count every deployment to every environment, including development and staging, the number rises without any improvement in production delivery. Some teams count each microservice deployment separately, inflating frequency metrics without any corresponding change in the rate at which user-facing value is delivered. Lead time for changes can be improved on paper by batching only the smallest, fastest changes into the measurement window while letting complex changes age outside it. Change failure rate can be gamed by reclassifying what counts as a failure — if only P0 incidents trigger the metric, teams have strong incentives to classify P1 incidents as not-quite-P0.

None of this means DORA metrics are bad. They are the best validated framework available. The implication is that DORA metrics work as diagnostic tools — used to identify systemic problems and track the impact of process changes — but break down as performance management tools the moment teams feel their careers depend on the numbers. The metrics should be used by leaders to interrogate the engineering system, not as report cards for individual teams.

The Metrics That Actually Predict Engineering Health

The most reliable indicators of engineering health are harder to collect than velocity or DORA scores, but they are much harder to game because they measure outcomes that engineers cannot control through optimising their behaviour in isolation. Customer-reported defect rate over time is one such indicator — it requires real users encountering real problems, and it is difficult to manipulate without improving the actual product. Unplanned work as a percentage of total engineering capacity is another: organisations with high unplanned work rates have systemic reliability or process problems that show up in the metric regardless of how well-formatted their sprint velocity charts look.

Mean time to detect production issues — not just mean time to resolve — is a particularly powerful signal. Teams with excellent observability and monitoring detect issues within minutes. Teams with weak instrumentation discover them when a customer complains hours or days later. The difference between these two states represents an enormous gap in operational maturity that is invisible in most standard engineering dashboards. Similarly, on-call burden — specifically, how often on-call engineers are paged outside business hours and how complex those pages are to resolve — is a leading indicator of system reliability and technical debt accumulation that rarely appears in the metrics packages leadership reviews.

Employee-reported metrics, collected through regular, anonymous surveys, round out this picture. Questions about whether engineers have what they need to do their best work, whether they feel proud of the quality of what they ship, and whether they find the engineering environment improving or degrading are leading indicators of retention and output quality that financial or delivery metrics consistently lag by six to twelve months.

Leading vs Lagging Indicators in Engineering

Most engineering metrics that leadership actually reviews are lagging indicators. Velocity tells you what a team completed in the past sprint. DORA metrics tell you how the team performed over the past quarter. Defect rates tell you about quality problems that have already reached customers. These are all useful signals — but by the time they show a problem, the problem has already been developing for weeks or months. Leading indicators allow leaders to intervene before problems fully materialise.

In engineering, leading indicators include: the rate at which technical debt items are being created versus retired (teams that are creating debt faster than they're retiring it are moving toward a future slowdown), the depth and health of the backlog refinement process (teams with poorly refined backlogs will experience planning failures before they experience delivery failures), and the trend in code review turnaround time (lengthening review cycles are often an early signal of team strain or process breakdown before those problems show up in velocity).

The challenge with leading indicators is that they require more interpretive skill and contextual knowledge to use correctly. A lengthening code review cycle might signal team strain, or it might signal that the team is being appropriately thorough on a particularly complex codebase change. No metric removes the need for judgment — but a thoughtful combination of leading and lagging indicators gives leaders a much richer diagnostic picture than either set alone.

How to Build a Metrics System That Can't Be Easily Gamed

A metrics system that resists gaming has three structural properties. First, it includes metrics at multiple levels of abstraction — input metrics (what the team is doing), output metrics (what the team is delivering), and outcome metrics (what impact the delivery is having on users and the business). Gaming one level is difficult when other levels provide a contradicting signal. A team can inflate their velocity number, but they cannot simultaneously inflate their customer satisfaction scores and deflate their defect rate.

Second, it uses metrics for diagnosis rather than evaluation. The difference matters enormously for team behaviour. When metrics are used diagnostically — to identify problems and guide improvement — engineers have no incentive to game them because good metrics lead to interventions that make their lives easier. When metrics are used evaluatively — to judge individual or team performance — engineers optimise for the metric rather than the thing it was designed to measure.

Third, it changes regularly in response to what the organisation has learned. No metrics system designed at one point in time remains optimal indefinitely. As the team's maturity evolves, as the product's complexity changes, and as the business's strategic priorities shift, the metrics that best capture engineering health will shift too. Engineering leaders who treat their metrics system as a living artefact — reviewed and updated quarterly — build more honest measurement cultures than those who set metrics once and treat the dashboard as permanent infrastructure.

The Human Metrics: What Numbers Can't Capture

Every quantitative metrics system in engineering leadership operates alongside an invisible qualitative layer that the numbers cannot capture but that dominates actual engineering outcomes. The best engineering leader diagnostic tool is still conversation — regular, honest, psychologically safe dialogue with engineers at multiple levels of the organisation. The things that emerge from those conversations — the problems with a specific architecture decision, the interpersonal friction on a particular team, the growing sense that the codebase is becoming unmanageable — are the leading signals that show up in quantitative metrics only months later.

Staff retention and voluntary attrition are the most consequential human metrics, and they resist any simple quantitative analysis. When a senior engineer leaves, the loss in institutional knowledge, mentoring capacity, and architectural judgment is real but invisible in any standard engineering dashboard. A team that loses two senior engineers in a quarter and replaces them with two mid-level hires will show identical headcount — and dramatically different capability. Leaders who rely exclusively on headcount as a capacity metric will consistently underestimate the disruption cost of senior attrition.

The honest conclusion for any engineering leader building a measurement system is that the quantitative and qualitative layers are not substitutes for each other — they are complements. Numbers give you scale and trend. Conversations give you causality and context. Leaders who invest in both, and who are humble about what each can and cannot tell them, make better decisions than those who mistake a well-designed dashboard for a complete picture of their engineering organisation.

Are Your Engineering Metrics Telling the Truth?

MindZBASE works with CTOs and engineering leaders to audit their measurement systems, identify the metrics creating perverse incentives, and build dashboards that reflect actual engineering health. Let's review what you're measuring together.

Schedule a Consultation

← Back to Blog