How Engineering Leaders Should Think About SRE

Site reliability engineering dashboard and monitoring

Site Reliability Engineering has become one of the most influential and most misunderstood disciplines in modern software engineering. Organisations hire SREs without understanding what they are hiring them to do. They implement SLOs without understanding what SLOs are supposed to change. They run error budget reviews without understanding why error budgets exist. The result is SRE theatre — the rituals of reliability engineering without the outcomes.

SRE Is a Philosophy, Not a Job Title

The most important thing engineering leaders can understand about SRE is that it is a philosophy of reliability engineering — a set of principles about how to build and operate reliable systems — not a job title or a team structure. The principles: define reliability in terms of the user experience using SLOs; use error budgets to make trade-off decisions explicit; eliminate toil through automation; treat operations with the same engineering rigour as product development. These principles can be applied by product engineering teams without a dedicated SRE function.

Why Most Companies Get SRE Wrong on Day One

The most common implementation mistake is treating SRE as a renamed operations team. The SRE team inherits the incident response, the change management, and the infrastructure maintenance responsibilities of the previous ops team — and is expected to simultaneously transform these into engineered, automated systems while running them manually. The result is a team that is too buried in toil to do the engineering work that would reduce toil. This is the SRE bootstrap paradox, and it is only resolved by giving the team explicit protection from toil — either time-boxing the operational work or transferring it to a separate function during the transition.

The SLO Contract: Reliability as a Business Conversation

Service Level Objectives transform reliability from a technical concern into a business conversation. Instead of discussing uptime percentages and p99 latencies, SLOs frame reliability in terms of the user experience that the business has committed to delivering. A 99.9% SLO for checkout completion rate means something to a product manager; a 200ms p95 latency target does not. This translation is one of the most valuable contributions SRE thinking makes to engineering organisations — it creates a shared language for reliability decisions between technical and non-technical stakeholders.

Error Budgets: The Most Misunderstood Concept

Error budgets are the mechanism that makes SLOs actionable. If your SLO is 99.9% monthly availability, your error budget is 0.1% of monthly request volume — approximately 43 minutes of downtime per month. This budget can be spent on incidents, deployments, experiments, or maintenance. When the budget is exhausted, the implication is that the service is less reliable than the business has committed to, and that new feature releases should be paused until reliability is restored. This governance mechanism only works if leadership enforces it — which most organisations do not.

The Toil Problem: When SRE Teams Become Ops Teams

Toil is the operational work that is manual, repetitive, automatable, and does not produce lasting value. SRE teams accumulate toil the way codebases accumulate technical debt — gradually, without anyone making a deliberate decision that toil is acceptable. When toil consumes more than fifty percent of an SRE team’s time, the team has effectively become an operations team with better tooling. The SRE function has been lost. Preventing this requires explicit toil measurement, leadership commitment to toil reduction targets, and the organisational authority to push back on toil-generating requests.

Embedding SRE Thinking Without a Dedicated SRE Team

Not every organisation needs dedicated SREs. The principles of SRE — SLOs, error budgets, blameless post-mortems, toil elimination — can be embedded in product engineering teams without a separate function. This requires training, tooling, and a leadership commitment to the practice. It also requires a clear owner for each system’s reliability: a product team that owns its SLOs, measures its error budget, and has the authority to pause feature work when reliability is at risk. The practices without the ownership produce compliance without accountability.

When to Hire Dedicated SREs (and When Not To)

Dedicated SREs are most valuable in organisations that have multiple services with high reliability requirements, where on-call burden has grown beyond what product teams can sustainably absorb, and where the complexity of the reliability engineering work justifies specialisation. Below a threshold of roughly twenty to thirty services or five to ten engineers on-call, the overhead of a dedicated SRE function typically exceeds the value it delivers. Above that threshold, the investment compounds.

Measuring SRE Success: The Metrics That Matter

SRE success is measured in outcome metrics: the reliability of user-facing services against their SLOs, the frequency and severity of incidents, the proportion of engineering time spent on toil versus reliability improvement, and the mean time to recover from incidents. Process metrics — SLO coverage, error budget reviews conducted, post-mortems completed — are leading indicators that the practice is being applied. They are not substitutes for outcome metrics.

Evaluating your cloud strategy and want an independent perspective?

Our cloud architects work with CTOs and heads of infrastructure to design cloud strategies that are right for your organisation’s scale, regulatory environment, and engineering maturity.

Schedule a Consultation

← Back to Blog