Reliability
Reliability Is a Portfolio Decision

I once watched a quarterly roadmap get approved that assumed—without anyone saying it—three critical systems wouldn’t fail during the same two-week window. No one modeled it. No one raised it. It was just the bet the organization was making by not discussing it.
Every engineering organization carries reliability bets. Most haven’t written them down. The roadmap is visible. The risk allocation behind it isn’t.
I’ve worked inside enterprise platforms where that gap led to real costs—not from negligence but because the reliability conversation never reached planning. It appeared only in post-mortems after the negotiation failed.
The Binary Trap
Many organizations treat reliability as an all-or-nothing proposition: the system works or it’s down. That creates two pitfalls: over-investing everywhere, draining budgets and slowing delivery, or delaying action until a public incident demands a response.
Southwest Airlines’ December 2022 meltdown was a portfolio allocation failure that manifested in technology. Their crew scheduling system had been flagged internally for years, but leadership kept choosing to invest elsewhere. The bet was implicit. When winter weather hit, the portfolio collapsed, and the DOT fined them while they absorbed nearly $1B in losses.
Tiered Allocation
The question isn’t whether to invest in reliability. It’s whether you’ve made that allocation explicit or let it follow whoever last controlled the budget and whatever last failed.
I frame this in three tiers. Not as an SRE framework, but as a planning lens.
Structural systems are those where failure isn’t just an incident. It’s existential. Payment processing, authentication, and data integrity qualify here. These receive deep investment: redundancy, automated failover, rigorous testing. You don’t negotiate these down. In regulated environments, compliance typically makes systems structural by default. If data is financial or PII, regulatory overhead supersedes what engineering might otherwise decide.
Elastic systems tolerate some degradation when managed well. Search can be briefly stale, a recommendation engine can return defaults, and a notification can be delayed. Invest so these fail gracefully but not so much that they never fail. Design the degradation path rather than hoping one appears.
Disposable systems are those where failure is cheap and recovery is fast. Internal tooling, experiments behind feature flags, and batch jobs that can be re-run all fit here. By design, these get minimal reliability investment, which is appropriate if deliberate.
These tiers classify systems, not teams, and they’re dynamic. Major reliability failures often occur when a system quietly shifts tiers. A tool built on a Friday gets adopted by customer support, baked into a daily workflow, and becomes load-bearing without engineering ever knowing. When it breaks, you discover it was structural all along. Maintaining the portfolio means recognizing when a system has drifted before an incident proves it.
Many organizations struggle here: they invest as if everything is structural but treat everything as disposable when budgets shrink. Since the portfolio remains implicit, risk allocation is set by whoever is most afraid—not by whoever has the best information.
Operational Readiness
I learned this lens long before software. In the Army, operational readiness isn’t asking, “Are we ready for anything?” It’s, “What are we prepared for, what aren’t we, and does leadership understand?” Units report status across personnel, equipment, and training. Commanders deploy based on known gaps rather than a single green, yellow, or red. You’d never send a unit downrange without knowing where it stands.
Yet many engineering organizations approve quarterly roadmaps without this conversation about systems—not out of disregard, but because no one built the mechanism.
Making It Legible
The mechanism doesn’t need to be complex. It needs to translate reliability into the language the business already uses.
Revenue protection: What do you risk if this system fails during peak? If the payment gateway goes down during holiday traffic, the exposure isn’t theoretical: it’s dollars per minute in lost transactions.
Recovery cost: What does it take in engineering time, customer goodwill, and contractual standing to recover? A four-hour outage on a structural system can cost a weekend of work, a flood of support tickets, and a difficult conversation with a client whose SLA you just missed.
Opportunity cost: What won’t get built while keeping legacy systems alive? If a team spends 30% of its capacity nursing a service that should have been replaced two quarters ago, that’s not a maintenance line item. It’s a feature that never shipped.
When I frame reliability in these business terms instead of latency or error budgets, the conversation shifts. You’re no longer pitching uptime to people who assume uptime is free. You’re pitching risk-adjusted investment to people who think in portfolios.
The artifact that makes this real is simple: a one-page summary with four columns: system, tier, exposure, and the cost to close the gap. It’s something you bring to a quarterly business review, not a dashboard engineers stare at in isolation. If it doesn't fit on one page, you’re overcomplicating it.
The Bet You’re Already Making
Your organization already has a reliability portfolio. It’s either explicit—documented, defended, reviewed—or implicit, shaped by inertia, recency bias, and whoever made the most noise last quarter.
Making the portfolio explicit won’t prevent all failures. But when something breaks, the organization learns from documented decisions instead of excavating Slack threads to figure out who decided this system didn’t matter.








